Post

Conversation

Finally filled out my forecasts. This was excellently designed! 🙌Loved having the context laid out since I'm too lazy to read the papers. TL;DR I think everything saturates except FrontierMath, which merely dectuples.😬 Talk about it at Minifest tomorrow!
Image
Quote
Sage
@sage_future_
Image
Is AGI just around the corner or is AI scaling hitting a wall? To make this discourse more concrete, we’ve created a survey for forecasting concrete AI capabilities by the end of 2025. Fill it out and share your predictions by end of year! bit.ly/ai-2025 🧵
I never felt like a "long timelines person" tbh, since 2019 my attitude has been "Holy crap this could be soon & we should prep." Left tails matter more than medians! And IME a lot of people who have "shorter" timelines than me forecast much weaker endpoints.
FWIW communication wise, my understanding of your views 1-2 years ago from skimming BioAnchors and reading summaries + your update + a talk you gave + LW debate, would absolutely not have made me think you would be expecting this for 2025. I am highly surprised.
A lot of what's going on here is that I think we've seen repeatedly that benchmark performance moves faster than anyone thinks, while real-world adoption and impact moves slower than most bulls think.
I have been surprised every year the last four years at how little AI has impacted my life and the lives of ordinary people. So I'm still confused how saturating these benchmarks translates to real-world impacts.
As soon as you have a well-defined benchmark that gains significance, AI developers tend to optimize for it, so it gets saturated way faster than expected — but not in a way that generalizes perfectly to everything else.
In the last round of benchmarks we had basically few-minute knowledge recall tasks (e.g. bar exam). Humans that can do those tasks well also tend to do long-horizon tasks that draw on that knowledge well (e.g. be a lawyer). But that's not the case for AIs.
David Watson 🥑
Post your reply

This round of benchmarks is few-hour programming and math taskss. Humans who do those tasks very well can also handle much longer tasks (being a SWE for many years). But I expect AI agents to solve them in a way that generalizes worse to those longer tasks.
In this specific case, my assumption is that the model has been pre-trained on a corpus of e.g. hundreds of practice questions, making the Bar exam a benchmark perfectly crafted to overestimate LLMs relative to humans.