Ethan Mollick: "👀 A 10 page paper caused a recent panic because of a math error. I was curious if AI would spot the error by just prompting: “carefully check the math in this paper” especially as the info is not in training data. o1 gets it in one shot. This feels like a big capability gain for scientific work"

Post

Ethan Mollick

‪@emollick.bsky.social‬

👀 A 10 page paper caused a recent panic because of a math error. I was curious if AI would spot the error by just prompting: “carefully check the math in this paper” especially as the info is not in training data. o1 gets it in one shot. This feels like a big capability gain for scientific work

December 15, 2024 at 8:21 AM

41 reposts

14 quotes

200 likes

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

Article on error with link to paper: nationalpost.com/news/canada/...

How a simple math error sparked a panic about black plastic kitchen utensils

A study warned that people were being exposed to a toxic chemical from kitchen utensils. But the researchers misstated the safe daily limit

nationalpost.com

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

This was o1, not pro. I just pasted in the article with the literal prompt above. Claude did not spot the error when given the PDF until it was told to look just at the reference value.

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

Paper if you want to try to experiment yourself (not yet updated to correct the error): www.sciencedirect.com/science/arti...

From e-waste to living space: Flame retardants contaminating household items add to concern about plastic recycling

Brominated flame retardants (BFRs) and organophosphate flame retardants (OPFRs) are commonly used in electric and electronic products in high concentr…

www.sciencedirect.com

‪wenchoheelio.bsky.social‬ ‪@wenchoheelio.bsky.social‬

Gemini pro 1.5 got it in one shot also

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

These are doing web searches. O1 is not

‪wenchoheelio.bsky.social‬ ‪@wenchoheelio.bsky.social‬

ChatGPT 4o didn't appear to search the web either. I thought it says if it is, but I may be wrong as I'm less familiar with that one.

‪Rebecca K‬ ‪@datarebecca.bsky.social‬

Sort of. But you wouldn't have thought "carefully check the math" was something you needed to request before this. I feel like the trick is that it's a different gotcha each time. (Though: a big careful prompt of basic things you should just check would still be a win.)

‪Owen Davis‬ ‪@odavis.bsky.social‬

I have a hunch that a substantial portion of new AI use cases (generative or otherwise) will fall under the heading of quality control. You see this in AI adoption case studies, e.g., with product inspection in manufacturing. Has interesting implications for worker-level impacts.

‪Joshua Mask‬ ‪@joshuafmask.bsky.social‬

I highly recommend folks have o1 pro or Gemini 2.0 perform a “reviewer 2” style referee report before you submit to an actual journal. It’s very illuminating.

‪leventov.bsky.social‬ ‪@leventov.bsky.social‬

22h

Ethan, do you cherry-pick the stuff you post on twitter/bsky? How many experiments do you do that never make it to your twitter in which none of the AIs do anything remarkable or badly misunderstand your intent?

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

22h

On social media (as opposed to papers) I try to show the typical frontier of what AI can and can't do. I don't post every failure, but I also don't generally post successes that required prompt engineering or repeated attempts, and try to be clear when I do. So selection but not cherry-picking?

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

22h

On social media, I tend to test AI's further limits, but it does raise the usual question: if AI can't do something, is it a prompt issue or an AI issue or something else? I think most people tend to very much underestimate what AI can do. Also, everything I post should be repeatable by anyone.

‪Dr. Jack Mitcham‬ ‪@jackmitcham.bsky.social‬

15h

And when the issue is just prompt engineering, that can be handled by using specialized tools with GenAI operating on the back end with the "correct" prompt hardcoded in.

‪scrattle.bsky.social‬ ‪@scrattle.bsky.social‬

Tried to reproduce: - Neither Sonnet nor Opus could spot the error - Neither 1206-exp or 2.0 Flash could spot the error - 4o and o1-preview did not spot it. - o1-mini *did* spot it, but mentioned it as likely a typographical error. Seems like we still need to put a bit of effort into the prompt.

‪Andreas‬ ‪@andreasthinks.me‬

13h

Feels like the real challenge now is to run the same prompt over a bunch of recently published papers and see how many new mistakes you can uncover...

‪Mohamed Alani‬ ‪@mohamedalani.bsky.social‬

That's impressive; it seems like AI is really stepping up its game in helping us avoid these pitfalls.

‪dr. jean-louis Amat 🇪🇺‬ ‪@jlamat.bsky.social‬

This kind of error detection is expected from reviewers...

‪Kelly Young‬ ‪@twinkleberi.bsky.social‬

Ethan you seem to know a lot about AI. Could you recommend an AI to replace ChatGPT?

‪dtjw.bsky.social‬ ‪@dtjw.bsky.social‬

To err is human. Non-independnent AI systems will also err but the amount of attention that it will have available to find errors will also be a lot more.

‪isoma‬ ‪@isoma.bsky.social‬

16h

Attention isn't all you need, but it sounds like it sure helps.

‪Robert Millard PhD‬ ‪@robmillard.bsky.social‬

We are on the brink of something so big ….

‪Gmack‬ ‪@gmac65.bsky.social‬

Still mildly shocked that such a trivial error could appear in a peer reviewed paper...

‪Benj Edwards‬ ‪@benjedwards.com‬

Wow this is cool

‪Luis Villa‬ ‪@lu.is‬

That’s impressive.