AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination
AIME 2025 part I was conducted yesterday, and the scores of some language models are available here:
matharena.ai thanks to , et al.
I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.
That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It's really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can't multiply 3-digit numbers. I was wrong, I guess.
I then used openai's Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora:
quora.com/In-what-bases-
I thought maybe it was just coincidence, and used Deep Research again on Problem 3. And guess what? A very similar question was on math.stackexchange:
math.stackexchange.com/questions/3548
Still skeptical, I used Deep Research on Problem 5, and a near identical problem appears again on math.stackexchange:
math.stackexchange.com/questions/3146
I haven't checked beyond that because the freaking p-value is too low already. Problems near identical to the test set can be found online.
So, what--if anything--does this imply for Math benchmarks? And what does it imply for all the sudden hill climbing due to RL?
I'm not certain, and there is a reasonable argument that even if something in the train-set contains near-identical but not exact copies of test data, it's still generalization. I am sympathetic to that. But, I also wouldn't rule out that GRPO is amazing at sharpening memories along with math skills.
At the very least, the above show that data decontamination is hard.
Never ever underestimate the amount of stuff you can find online. Practically everything exists online.
Post
Conversation
Interesting finding.
My take on it is that this is less about a "bad" dataset and more about highlighting just how hard it is to create original math problems. And I suspect this isn't AIME25, I suspect other competitions suffer from similar problems.
If you start creating new
Show more
5 predictions for AI in 2025 presented by
This is a bit like saying AI driven vehicles did well on a driving test using existing roads.
Wait, I didn't realize that AIME is literally just 15 problems every year. Why are people using a 15 problem set as a major benchmark?
These language models must study with great determination before we raise more benchmarks.