Ethan Mollick: "New study shows LLMs outperform neuroscience experts at predicting experimental results in advance of experiments (86% vs 63% accuracy). They use a fine-tuned Mistral 7B but other models worked too. Suggests LLMs can integrate scientific knowledge at scale to support research."

Post

Ethan Mollick

‪@emollick.bsky.social‬

New study shows LLMs outperform neuroscience experts at predicting experimental results in advance of experiments (86% vs 63% accuracy). They use a fine-tuned Mistral 7B but other models worked too. Suggests LLMs can integrate scientific knowledge at scale to support research.

November 29, 2024 at 9:10 AM

19 reposts

6 quotes

106 likes

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

www.nature.com/articles/s41...

https://www.nature.com/articles/s41562-024-02046-9.pdf

www.nature.com

‪Maximilian Hoffmann‬ ‪@mh123.bsky.social‬

To be fair: No experiments were conducted in this study. LLM discriminated between published abstract or an altered version of it. Training data was reasonably shown not to be contaminated by said abstracts.

‪gnperdue‬ ‪@gnperdue.bsky.social‬

That is quite different from what is implied by the abstract of this paper. I haven’t read this paper, but how did they audit the training data to be sure it didn’t contain the papers from this study?

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

‪gnperdue‬ ‪@gnperdue.bsky.social‬

Thanks.

‪Alex Choi‬ ‪@aschoi.bsky.social‬

I mean, isn't it possible that LLMs could just be better at reading scientific abstracts? Not necessarily better at predicting neuroscience results?

‪Pheasant Plucker‬ ‪@pheasant.bsky.social‬

Better at predicting what an abstract should look like. The scientists are looking for novelty, the LLM is looking for patterns

‪Yhonatan Shemesh‬ ‪@yshemesh.bsky.social‬

Given LLMs model multiple perspectives in the training data, they may express a sort of wisdom-of-crowds (higher accuracy than any one individual). But this may also mean worse predictions on surprising/exceptional results. If true, this would strongly motivate against LLMs to guide research agenda

‪Yhonatan Shemesh‬ ‪@yshemesh.bsky.social‬

For example: open.substack.com/pub/understa...

@binarybits.bsky.social

Why the deep learning boom caught almost everyone by surprise

"You’ve taken this idea way too far," a mentor told Prof. Fei-Fei Li.

open.substack.com

‪yo-cuddles.bsky.social‬ ‪@yo-cuddles.bsky.social‬

This is probably a function of LLM's superhuman memory. Not strict memorization, which seems controlled for, but I imagine many of these questions were pretty predictable to specialists. LLM's would prob score do better on a test that combined tests for doctors and lawyers than almost any human

‪caracter.bsky.social‬ ‪@caracter.bsky.social‬

Just my two cents. So basically you are saying LLMs predict the study results before it happened? How is that significant? Easn't the point of LLMs to predict the most likely token? So I would expect them to very good at anything related to prediction.

‪themajor.bsky.social‬ ‪@themajor.bsky.social‬

... Hinton makes this point alot ... still getting my head around it

‪Graham Erwin‬ ‪@grahamerwin.bsky.social‬

Thanks Ethan! The two things that caught me were 1) fine-tuning only gets you an additional 3% accuracy and 2) "instruct" models perform worse than base models! The authors go on to say "...aligning LLMs to engage in natural language conversations hinders their scientific inference abilities."

‪johnrecords.bsky.social‬ ‪@johnrecords.bsky.social‬

Impressive that a 7B model could do this.

‪Valerie the Neuroscientist‬ ‪@docvalerie.bsky.social‬

Thanks for this. IMO though, this is not surprising. Scientists have become increasingly overspecialized in the past 30 years, while LLM's through their training properties become rather, generalists. Take a sample of more generalist scientists who practice across interdisciplinary fields & try.

‪Daniel Mewes‬ ‪@dmewes.com‬

One thing I'd expect LLMs to be better at than humans is being calibrated correctly to the statistical distribution of the two possible outcomes, as they will resemble max likelihood prediction. Maybe human scientists could improve their accuracy if they first went through calibration training?

‪Daniel Mewes‬ ‪@dmewes.com‬

One way to validate this would be to ask humans to provide a 0-1 likelihood for one of the binary experiment outcomes, instead of asking for a binary choice Then determine the max-likelihood threshold based on a calibration ("training") set for each individual human rater to build a human classifier

‪SlavaV‬ ‪@slavav.bsky.social‬

Insane that 7b model achieves that

‪Alex Washburne‬ ‪@reptalex.bsky.social‬

I don't have a subscription - how did they confirm the studies used for out-of-sample prediction weren't either (A) used in the initial model training/fine-tuning or (B) models whose results are summarized in later papers? Wondering how much of this is "intelligence" vs. good training data

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

‪Alex Washburne‬ ‪@reptalex.bsky.social‬

I'm still reading into it, but I'm not sure zlib-perplexity is a good way to address the question as the data-agnostic text compression from zlib doesn't address the many ways scientific statements can appear in training data. The number of microbes in the human gut is one example that comes to mind

‪Alex Washburne‬ ‪@reptalex.bsky.social‬

"About 100T microbes live in the gut" "The human gut has an estimated 100T microbes" "The number of microbes in the human gut totals 100 trillion" "The average human gut contains roughly 100T microbes" Quantitative statements & connections between nuanced parts of systems could have many phrasings

‪janbam‬ ‪@janbam.bsky.social‬

biased