Post
Jeff Dean
‪@jeffdean.bsky.social‬
We’ve been thrilled by the positive reception to Gemini 2.0 Flash Thinking we discussed in December. Today we’re sharing an experimental update w/improved performance on math, science, and multimodal reasoning benchmarks 📈: • AIME: 73.3% • GPQA: 74.2% • MMMU: 75.4%
January 21, 2025 at 4:31 PM
32 reposts
6 quotes
158 likes
We’re introducing 1M long context to this experimental update, enabling deeper analysis of long-form texts—like multiple research papers 📝 📝 📝 or extensive datasets 🗄. We’re also giving you tool use capabilities with the ability to turn on code execution in this model.
We’re continuing to iterate, with higher reliability and reduced contradictions between the model’s thoughts and final answers. Check it out as gemini-2.0-flash-thinking-exp-01-21 at goo.gle/4jsCqZC
This model debuts at #1 on the lmarena leaderboard.
Great results, congrats! But the best part is that you're on bsky!
Congratulations. Really liking the flash thinking model.
I have been testing side by side all four frontier models, and Gemini 2.0 flash thinking has been consistently giving the more detailed and precise answers. I am particularly amazed at the multimodal skills of Gemini, seeing details in images that humans can barely discern… You guys have been busy!
Glad it's working well for you!
The flash model is so much faster too. Makes services built with it feel so much better..
Congrats, Jeff. The gains in math benchmarks are almost unbelievable. But can you please please start focusing on code? Many of us devs using LLMs are not using them to do math, we are using them to write and edit code, an area where Gemini models are still far behind.
the belief is that any domain they focus on will cause behavior to emerge that can be applied in all other domains. math is tagged as a priority because it’s used everywhere, math intuition helps a lot, and it’s easy/cheap to verify
I get that. I just don't understand how a competing model like Claude Sonnet 3.5 continues to be one of the leading frontier models in code, creative writing and reasoning despite Sonnet being terrible at math. I guess I expected more from thinking models, especially when it comes to code.
the prize isn’t code — it’s agents. math likely holds some traits needed for keeping an agent on track, like logical thinking, following processes, etc.
When can we see increased quotas for Flash 2.0 on Vertex? I'm itching to switch over but it's just not possible at the moment.