Conversation
2/ Olympiad problems are qualitatively different from GSM8K/MATH:
- Multi-step, idea-driven proofs
- Require abstraction and planning
- Answers alone are not informative
Despite high MATH accuracy, models fail dramatically when reasoning is evaluated directly.
3/ Even when models guess the correct final answer, their derivations are often logically invalid. We examined the solutions for questions with a concrete final answer and found that the chance of getting a correct solution given a correct final answer is nearly zero.
4/ It’s not just that LLMs fail on difficult IMO shortlist problems — their mistakes tend to follow consistent patterns. After a detailed review, we grouped the most common failure modes into the following categories:
(Concrete examples can be found in our draft.)
5/ The type of reasoning failure an LLM makes is linked to the type of math problem it’s solving.
We found strong correlations between fallacy modes and (1) whether a final answer is expected and (2) the math domain.
6/ When LLMs are prompted with questions that expect a final answer, two fallacies dominate:
– Proof by Example
– Solution by Trial-and-Error
Why? Because the models often don’t derive the answer — they guess, extrapolate from small test cases.
7/ In contrast, when models are prompted with pure proof problems, two other failure modes emerge:
– Inventing Wrong Facts
– Proposal Without Verification
To correctly solve such problems, the models must logically connect the problem’s initial assumptions and constraints to /
8/ the conclusion through a sequence of sound reasoning steps. However, we find that models frequently substitute sound reasoning with unjustified and often incorrect statements. This is likely due to how these models are trained. The reward signal for training reasoning /
9/ models probably focus only on final answers, formatting, and other surface-level factors without adequately accounting for the validity of the reasoning process itself.
10/ Similar patterns of difference can be observed in relative frequencies of fallacies among different problem topics.
11/ As more geometry questions can be solved only using purely logical reasoning rather than algebraic manipulations,
- Inventing Wrong Facts,
- Proposal Without Verification
- Begging the Question
are more common in geometry problems.
12/ In contrast, a significant proportion of algebra problems fall into categories such as functional equations, polynomial equations, or optimization tasks.
We observed that all models tend to rely on trial and error to determine the final answer.
13/ This behavior results in a higher frequency of the Solution by Trial and Error fallacy in LLM generated solutions for algebra problems. Similarly, number theory problems involving Diophantine equations or integer-valued functional equations exhibit the same issue.
14/ Additionally, we found that the Proof by Example fallacy occurs more frequently in algebra, combinatorics, and number theory problems compared to geometry.
15/
This trend arises because many problems in these three areas can be framed as proving statements of the form Q(x), where x belongs to a specific domain defined by the problem.
16/ In such cases, LLMs frequently attempt to verify the proposition Q by evaluating selected examples from its domain rather than constructing a general proof thus resulting in the Proof by Example fallacy.
17/ Some might ask: can't we simply use agentic schemas or test-time scaling techniques to improve solution quality? But to truly bootstrap, a model must first be able to recognize its own mistakes. We designed two experiments to investigate this
18/ We evaluate two settings:
1- Do LLMs identify correct, authentic solutions as valid more frequently than fallacious generated solutions?
2- Given a pair (one valid, one incorrect), can it select the correct one?
19/ In setting (1), we used real AoPS forum solutions and generated incorrect ones. LLMs are prompted to make a binary judgment. Our results show they don’t label correct solutions as “correct” significantly more often than the fallacious ones.
20/ In setting (2), we gave models side-by-side solutions: one correct (AoPS), one incorrect (LLM generated). The prompt made it explicit that only one is valid.
Can they choose the right one?
Only DeepSeek R1 and o3-mini are barely better than random!
21/ The rest are either random or worse than random in detecting correct solutions from blatantly false ones. These results show that these models are not suitable as mathematical judges, even on Olympiad-level problems with clearly fallacious outputs.
22/ So in conclusion:
-LLMs still struggle with logical rigor in Olympiad-level math.
-They rely heavily on heuristics over true reasoning.
-Even as judges, they fail to distinguish valid proofs with fallacious ones.
23/
Our draft is accessible on Arxiv with more examples and all of the prompts we used.
Nice to see this quantified — I’ve experienced this viscerally myself with reasoning models. They can’t write proofs at all.
Yeah, I think it's mostly the problem with the problems that have some unusual creative solution idea like shortlist or USAMO. I experienced this myself. They are relatively better if you give them easier contests. Some countries have easier and some have harder national level/
Intuitively, I expect llm fails easily with no-go theorems like variations of halting problem. When *there is* a finite-set of solution, it is much easier for llm to find by exploration (like chess). However, non-exploratory examples will be a challenge for llm and ai