Post

Conversation

After the IMO results last summer, some dismissed it as “high school math.” We think our latest models will remove any doubt that STEM research is about to fundamentally change. Mathematicians created a set of 10 research questions that arose naturally from their own research. Only they know the answers, and they gave the world a week to use LLMs to try to solve them. We think our latest models make it possible to solve several of them. This is an internal model for now, but I’m optimistic we’ll get it (or a better model) out soon.
Image
Image
Quote
Jakub Pachocki
@merettm
Very excited about the "First Proof" challenge. I believe novel frontier research is perhaps the most important way to evaluate capabilities of the next generation of AI models. We have run our internal model with limited human supervision on the ten proposed problems. The
Show more
David Watson 🥑
Post your reply

The "high school math" dismissal was always goalpost-moving. But solving research problems still isn't posing them. The gap between "solves novel problems when asked" and "identifies which problems are worth solving" is where human researchers still have the edge.
The shift from "high school math" to research-level results changes the conversation entirely. The question isn't whether AI can do science but how we restructure verification, credit, and methodology when the most productive researcher in the room isn't human.
It's not ChatGBT or Gemini who did this wonders, A group of 11 top mathematicians created a super tough test called "First Proof". They picked 10 real math problems straight from their unpublished research stuff that had never been shared online. They weren't any simple high
The exciting part of #1stProof isn’t the headline, it’s the auditability. These problems are hard to verify, and “sounds right” proofs are exactly where models can look strongest while drifting. If you publish attempts, a per-problem evidence ladder
the quiet part here is the evaluation design, not the results. unpublished solutions held by the creators = the first benchmark where data contamination is structurally impossible. if this format scales, every 'but it just memorized the training data' argument dies overnight.
The “high school math” dismissal always misses the point. The real signal is if the model can push on fresh questions end to end: set up the right lemmas, keep a clean proof state, and produce something a domain expert can actually build on.
Solving olympiad problems proves skill — solving open research proves impact. If models start generating results mathematicians actually build on, that’s when the shift becomes real.
This solves one of the issues with benchmarks, now less chance of data contamination issue that even plague ARC AGI 2. But when are you putting out for public access. Gemini 3 Deepthink is already out.
The really clever part here is making the answers public but keeping them encrypted. That solves the "did it memorize this from training data" problem that haunted every other benchmark. Also love that theyre
This is the benchmark I actually care about: novel problems with crisp verification. Curious how much of the “limited supervision” was tool scaffolding vs pure reasoning.
Interesting second-order effect: stronger reasoning models shift bottlenecks from answers to problem framing. Teams that ask better questions will pull away fastest.
The IMO dismissal was cope. Research-level proofs are the real test. If models produce novel results that humans verify and publish, AI stops being a tool and becomes a colleague.
So I wonder if Einstein had this would GR been solved in weeks rather than 7 years. Principles are simple but the details of the math, well that's why NYT claimed there were only a dozen scientists in the world qualified to review the paper.
Below is what those 10 problems from the paper "First Proof" is really testing. And, what the limitation could this research method have?
Quote
LIFE 2030 and Beyond
@life2030com
Replying to @merettm
The recent paper “First Proof” (arXiv:2602.05192), presents 10 open problems to test AI. While the problems are interesting, I came away feeling a bit… disappointed. All 10 problems seem very well suited to algebraic and symbolic reasoning which today’s LLM-based systems are
Show more
Image
In code, we now do almost no review of AI-produced code, and in a year or two we will simply believe what AI produces. We will do the same with scientific research. A bit anxious, but we will accelerate human consciousness.
Calling IMO “just high school math” always missed the point. If models can now tackle fresh, unpublished research problems designed by mathematicians themselves, that’s not hype that’s a shift. Curious to see how this changes what doing math research even looks like.
We can only hope buys the Star Trek IP (ALL of it, not 85%), and puts a real fan like in control of it. Come on Elon! You rescued Twitter, you can do it for Star Trek as well! It's not like ST has no cultural influence, and it especially meshes well
The bottleneck is not solving a few handcrafted research questions. It is reliability under adversarial scrutiny, reproducibility of proofs, and sustained novelty beyond the training distribution.
I think when AI starts solving original research problems, we're watching the beginning of a fundamental shift. In my experience building tech, the tools that make impossible things possible always unlock exponential value. STEM research has been bottlenecked by human time for
If LLMs can tackle original math questions, we may soon see AI creating new insights, not just solving problems game changer for STEM research.
You never provided an update on the IMO model after saying it'll be released at the end of 2025. Is it 5.2 Pro?
They dismissed it as high school math. The machine graduated. One suspects the goalposts shall require their own moving company.
imo wasn't just high school math, it's peak creative problem solving. if the new model can bridge the gap from formal verification to high-level undergraduate analysis, the 'stochastic parrot' argument is officially dead lol. can't wait to see if it handles aops-style edge cases
x.com/CausaNova_DE/s You don´t need better LLM, you need to fix the hallucination problem. GPT the brain is guessing, CausaNova the body checks the results.
Quote
CausaNova
@CausaNova_DE
Just submitted research-level math problems to @1stproof that GPT-5.2 & Gemini 3.0 Deepthink couldn't solve. Using CausaNova - a neuro-symbolic verification system I built that proves correctness instead of hallucinating answers. The answers are encrypted until Feb 13. Let's
Show more
Image