Post

Conversation

Claude Sonnet 3.5 generated significantly better ideas for research papers than humans, but when researchers tried executing the ideas the gap between human & AI idea quality disappeared Execution is a harder problem for AI. (Yet this is a better outcome for AI than I expected)
Image
Image
Quote
Misha Teplitskiy | Science of Science
@MishaTeplitskiy
Verrrrry intriguing-looking and labor-intensive test of whether LLMs can come up with good scientific ideas. After implementing those ideas, the verdict seems to be "no, not really."
Image
David Watson 🥑
Post your reply

Technically, the gap reversed! Before execution expert reviewers score AI ideas higher than human ideas and after execution human ideas score higher. Ratings on human idea effectiveness are basically the same before and after, but ratings on AI idea effectiveness drop big time
Image
AI can see the destination but can’t navigate the journey. Ideas are about connecting dots that exist; execution is about creating dots that don’t.
So does that mean that Claude came up with more novel and interesting ideas - that were kinda unfeasible ? Maybe the human researchers filtered out the unviable ones through experience?
That drop-off in execution quality is the real story here. AI can brainstorm, but the messy reality of actually *doing* research is a whole different ballgame. Makes sense why that's harder for it right now.