Jeff Dean: "We’ve been thrilled by the positive reception to Gemini 2.0 Flash Thinking we discussed in December. Today we’re sharing an experimental update w/improved performance on math, science, and multimodal reasoning benchmarks 📈: • AIME: 73.3% • GPQA: 74.2% • MMMU: 75.4%"

Post

Jeff Dean

‪@jeffdean.bsky.social‬

We’ve been thrilled by the positive reception to Gemini 2.0 Flash Thinking we discussed in December. Today we’re sharing an experimental update w/improved performance on math, science, and multimodal reasoning benchmarks 📈: • AIME: 73.3% • GPQA: 74.2% • MMMU: 75.4%

January 21, 2025 at 4:31 PM

32 reposts

6 quotes

158 likes

‪Jeff Dean‬ ‪@jeffdean.bsky.social‬

We’re introducing 1M long context to this experimental update, enabling deeper analysis of long-form texts—like multiple research papers 📝 📝 📝 or extensive datasets 🗄. We’re also giving you tool use capabilities with the ability to turn on code execution in this model.

‪Jeff Dean‬ ‪@jeffdean.bsky.social‬

We’re continuing to iterate, with higher reliability and reduced contradictions between the model’s thoughts and final answers. Check it out as gemini-2.0-flash-thinking-exp-01-21 at goo.gle/4jsCqZC

‪Jeff Dean‬ ‪@jeffdean.bsky.social‬

This model debuts at #1 on the lmarena leaderboard.

‪Tamas Ujhelyi‬ ‪@tamas-ujhelyi.bsky.social‬

congrats! 🙂

‪Gus‬ ‪@gusthema.bsky.social‬

Great results, congrats! But the best part is that you're on bsky!

‪Jeff Dean‬ ‪@jeffdean.bsky.social‬

Glad to be here!

‪Akarshan Biswas‬ ‪@qnixsynapse.bsky.social‬

Congratulations. Really liking the flash thinking model.

‪Loïc A. Royer 💻🔬🧪‬ ‪@loicaroyer.bsky.social‬

I have been testing side by side all four frontier models, and Gemini 2.0 flash thinking has been consistently giving the more detailed and precise answers. I am particularly amazed at the multimodal skills of Gemini, seeing details in images that humans can barely discern… You guys have been busy!

‪Jeff Dean‬ ‪@jeffdean.bsky.social‬

Glad it's working well for you!

‪TheOverEngineered‬ ‪@theoverengineered.bsky.social‬

The flash model is so much faster too. Makes services built with it feel so much better..

‪Dwayne🗿‬ ‪@ilikekillnerds.com‬

Congrats, Jeff. The gains in math benchmarks are almost unbelievable. But can you please please start focusing on code? Many of us devs using LLMs are not using them to do math, we are using them to write and edit code, an area where Gemini models are still far behind.

‪Tim Kellogg‬ ‪@timkellogg.me‬

the belief is that any domain they focus on will cause behavior to emerge that can be applied in all other domains. math is tagged as a priority because it’s used everywhere, math intuition helps a lot, and it’s easy/cheap to verify

‪Dwayne🗿‬ ‪@ilikekillnerds.com‬

I get that. I just don't understand how a competing model like Claude Sonnet 3.5 continues to be one of the leading frontier models in code, creative writing and reasoning despite Sonnet being terrible at math. I guess I expected more from thinking models, especially when it comes to code.

‪Tim Kellogg‬ ‪@timkellogg.me‬

the prize isn’t code — it’s agents. math likely holds some traits needed for keeping an agent on track, like logical thinking, following processes, etc.

‪Nick Fisher‬ ‪@nickfisherau.bsky.social‬

When can we see increased quotas for Flash 2.0 on Vertex? I'm itching to switch over but it's just not possible at the moment.