Ethan Mollick on X: "Claude Sonnet 3.5 generated significantly better ideas for research papers than humans, but when researchers tried executing the ideas the gap between human & AI idea quality disappeared Execution is a harder problem for AI. (Yet this is a better outcome for AI than I expected) https://t.co/1Xkkdj2T9K" / X

Claude Sonnet 3.5 generated significantly better ideas for research papers than humans, but when researchers tried executing the ideas the gap between human & AI idea quality disappeared Execution is a harder problem for AI. (Yet this is a better outcome for AI than I expected)

Quote

Misha Teplitskiy | Science of Science

@MishaTeplitskiy

Jun 26

Verrrrry intriguing-looking and labor-intensive test of whether LLMs can come up with good scientific ideas. After implementing those ideas, the verdict seems to be "no, not really."

10:59 AM · Jun 27, 2025

43.2K

Views

Post your reply

prof-g

@robertghrist

Jun 27

and the rate of increase in capabilities of human execution is \epsilon, while for the AIs...

ai can't execute, we can't stop

196

talkspace

@talkspace

Ad

Members of the military and their families deserve fast access to private, convenient, and affordable mental health support. Sign up for Talkspace therapy, psychiatry, or teen therapy to receive high-quality care without delay, covered by TRICARE.

Mental Health Care Covered By TRICARE

Technically, the gap reversed! Before execution expert reviewers score AI ideas higher than human ideas and after execution human ideas score higher. Ratings on human idea effectiveness are basically the same before and after, but ratings on AI idea effectiveness drop big time

AI can see the destination but can’t navigate the journey. Ideas are about connecting dots that exist; execution is about creating dots that don’t.

So does that mean that Claude came up with more novel and interesting ideas - that were kinda unfeasible ? Maybe the human researchers filtered out the unviable ones through experience?

Research is a parallel process, I wonder if there's a lack of diversity in AI's research ideas?

What is the relevance now that we have reasoning models?

418

Misha Teplitskiy | Science of Science

@MishaTeplitskiy

Jun 27

Worth noting that there's some debate over whether the ideas (pre-execution step) were actually novel

arxiv.org

All That Glitters is Not Novel: Plagiarism in AI Generated Research

Automating scientific research is considered the final frontier of science. Recently, several papers claim autonomous research agents can generate novel research ideas. Amidst the prevailing...

They are great for exploring idea spaces. Less useful on executing on the ideas.

AIs excel at being impressive. That vein runs deep in the training data (basically social media)

Reminds of what Terence Tao said in the recent Friedman podcast: a lack of ‘smell’ or taste in their thinking

That drop-off in execution quality is the real story here. AI can brainstorm, but the messy reality of actually *doing* research is a whole different ballgame. Makes sense why that's harder for it right now.

297

talkspace

@talkspace

Ad

Members of the military and their families deserve fast access to private, convenient, and affordable mental health support. Sign up for Talkspace therapy, psychiatry, or teen therapy to receive high-quality care without delay, covered by TRICARE.

Mental Health Care Covered By TRICARE

From talkspace.com