To view keyboard shortcuts, press question mark

Post

Conversation

AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination AIME 2025 part I was conducted yesterday, and the scores of some language models are available here: matharena.ai thanks to

@mbalunovic

@ni_jovanovic

et al. I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%. That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It's really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can't multiply 3-digit numbers. I was wrong, I guess. I then used openai's Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora: quora.com/In-what-bases- I thought maybe it was just coincidence, and used Deep Research again on Problem 3. And guess what? A very similar question was on math.stackexchange: math.stackexchange.com/questions/3548 Still skeptical, I used Deep Research on Problem 5, and a near identical problem appears again on math.stackexchange: math.stackexchange.com/questions/3146 I haven't checked beyond that because the freaking p-value is too low already. Problems near identical to the test set can be found online. So, what--if anything--does this imply for Math benchmarks? And what does it imply for all the sudden hill climbing due to RL? I'm not certain, and there is a reasonable argument that even if something in the train-set contains near-identical but not exact copies of test data, it's still generalization. I am sympathetic to that. But, I also wouldn't rule out that GRPO is amazing at sharpening memories along with math skills. At the very least, the above show that data decontamination is hard. Never ever underestimate the amount of stuff you can find online. Practically everything exists online.

12:35 PM · Feb 8, 2025

88.7K

Views

Post your reply

Noorie

@nooriefyi

Feb 8

benchmarks are a snapshot in time. its the trend that matters.

Interesting finding. My take on it is that this is less about a "bad" dataset and more about highlighting just how hard it is to create original math problems. And I suspect this isn't AIME25, I suspect other competitions suffer from similar problems. If you start creating new

691

TIME

@TIME

5 predictions for AI in 2025 presented by

@SOMPO_JP

5 Predictions for AI in 2025

This is a bit like saying AI driven vehicles did well on a driving test using existing roads.

Wait, I didn't realize that AIME is literally just 15 problems every year. Why are people using a 15 problem set as a major benchmark?

If I'm not mistaken, math and other olympiads get new questions each year. So as long as new questions are created for benchmarks, data contamination shouldn't be an issue, right?

Let's propose a new benchmark: AIME-online Take all the leading LLMs and let them search the internet to solve the AIME. Now you get to evaluate both reasoning skills *and* math skills. And there's no worry about cheating because...everyone is cheating with the same rules.

These language models must study with great determination before we raise more benchmarks.

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

To view keyboard shortcuts, press question mark
View keyboard shortcuts