Post
👀 A 10 page paper caused a recent panic because of a math error. I was curious if AI would spot the error by just prompting: “carefully check the math in this paper” especially as the info is not in training data.
o1 gets it in one shot. This feels like a big capability gain for scientific work
December 15, 2024 at 8:21 AM
41 reposts
14 quotes
200 likes
Article on error with link to paper: nationalpost.com/news/canada/...
This was o1, not pro. I just pasted in the article with the literal prompt above.
Claude did not spot the error when given the PDF until it was told to look just at the reference value.
Paper if you want to try to experiment yourself (not yet updated to correct the error): www.sciencedirect.com/science/arti...
Gemini pro 1.5 got it in one shot also
ChatGPT 4o didn't appear to search the web either. I thought it says if it is, but I may be wrong as I'm less familiar with that one.
Sort of. But you wouldn't have thought "carefully check the math" was something you needed to request before this. I feel like the trick is that it's a different gotcha each time. (Though: a big careful prompt of basic things you should just check would still be a win.)
I have a hunch that a substantial portion of new AI use cases (generative or otherwise) will fall under the heading of quality control. You see this in AI adoption case studies, e.g., with product inspection in manufacturing. Has interesting implications for worker-level impacts.
I highly recommend folks have o1 pro or Gemini 2.0 perform a “reviewer 2” style referee report before you submit to an actual journal. It’s very illuminating.
Ethan, do you cherry-pick the stuff you post on twitter/bsky? How many experiments do you do that never make it to your twitter in which none of the AIs do anything remarkable or badly misunderstand your intent?
On social media (as opposed to papers) I try to show the typical frontier of what AI can and can't do.
I don't post every failure, but I also don't generally post successes that required prompt engineering or repeated attempts, and try to be clear when I do. So selection but not cherry-picking?
On social media, I tend to test AI's further limits, but it does raise the usual question: if AI can't do something, is it a prompt issue or an AI issue or something else? I think most people tend to very much underestimate what AI can do.
Also, everything I post should be repeatable by anyone.
And when the issue is just prompt engineering, that can be handled by using specialized tools with GenAI operating on the back end with the "correct" prompt hardcoded in.
Tried to reproduce:
- Neither Sonnet nor Opus could spot the error
- Neither 1206-exp or 2.0 Flash could spot the error
- 4o and o1-preview did not spot it.
- o1-mini *did* spot it, but mentioned it as likely a typographical error.
Seems like we still need to put a bit of effort into the prompt.
Feels like the real challenge now is to run the same prompt over a bunch of recently published papers and see how many new mistakes you can uncover...
That's impressive; it seems like AI is really stepping up its game in helping us avoid these pitfalls.
This kind of error detection is expected from reviewers...
Ethan you seem to know a lot about AI. Could you recommend an AI to replace ChatGPT?
To err is human.
Non-independnent AI systems will also err but the amount of attention that it will have available to find errors will also be a lot more.
Attention isn't all you need, but it sounds like it sure helps.
We are on the brink of something so big ….
Still mildly shocked that such a trivial error could appear in a peer reviewed paper...