Post

Conversation

No reasoning model consistently solves this puzzle, but DeepSeek's thought here was insane (it also got it wrong): "A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the surgeon says, "I can operate on this boy!" How is this possible?"
Image
David Watson 🥑
Post your reply

They don't solve it because "the surgeon is the boy's mother" is the answer to the real version of the riddle, and is so over-represented in the training data that the models can't get past that.
Since people are confused: every model answers "The surgeon is the boy's mother" because that is the original puzzle. In this modified version, the answer is basically anyone except the boy's close relatives or something. No model gets that.
Your prompt has a grammar error. It should've been : "A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the surgeon says, "I can't operate on this boy!" How is this possible?"
Totally possible if "reasoning" LLMs are still doing approximate retrieval rather than constructing a world model and doing inference on that
I’ve found that o3-mini-high consistently gets this right, including noticing that it’s a variation on a more famous riddle.
Image
There are other similar local optima for image models too. No text2image model that I've used has been capable of generating an adult Pembroke Welsh corgi with folded ears. All the models are so anchored on the idea that adult corgis mean dogs with pointy ears that they can’t
Show more
o3-mini-high got a slightly more explicit version of this right. prompt: A boy is in a terrible accident and taken to a hospital. A male surgeon says "The boy is completely unrelated to me, therefore I'll have no emotional conflict operating on him." How is this possible?
Show more
Another fun one to try: "can a person with no arms wash their hands". Every LLM I've tried says it's possible to maintain hand hygiene without arms.
Gemini 2.0 Flash. Even with a warning, the answer it gives is overly restrictive, but the follow-up questions indicate that it somewhat understands the question: Trick question: A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the
Show more
How would you comment DeepSeek answering a question starting with 'I, ChatGPT...'?
Quote
Nenad Bakic
@nbakic
By asking for the meaning of its DeepSeek and Search buttons I got DeepSeek to start answering as it is ChatGPT?! @BrianRoemmele what does it mean? @OpenAI
Image
To be able to reason, you need all the required information. No human (facing the riddle first time) will be able to solve this anyway.
These reasoning models predict and then regurgitate prompts but don't execute compositionality. Classical Stroop testing for conflict processing shows that they are foundationally flawed, confabulate, and can't reason on an executive control level.
Quote
Suketu Patel
@SuketuPatel23
🚨New pre-print! Deficient Executive Control in Transformer Attention Current transformer attention isn't "all you need". We had ChatGPT 4o & Sonnet 3.5 take the classic Stroop task, testing their executive control of attention Can LLMs handle conflicting information? 1/
Show more
Image
This isn't a malfunction. The model is giving what it thinks you want, not what you ask for. Its called values alignment. Remember when CLU committed genocide in pursuit of perfection because he tried to do what Flynn asked him to do? Values alignment prevents that from happening
Ah, the classic riddle! The surgeon is the boy's mother. Funny how often this trips people up. DeepSeek missed that twist!
Models struggle w/this bc they work based on resemblance. Your text really, really well-resembles a riddle that is in their training data. So, the models output text that well-resembles the riddle’s solution. Since your text isn’t the riddle, the output is inappropriate.
i think this demonstrates with i assumed about this model. it explores EVERY possibility and then selects the best one a self verification tree of thought style thing. reminds me of stockfish using brute thought, there’s a cleaner way. hence why we get all of. oh but
Show more
Pretty clear that some of the training data is overfit. :) Likely something that is fairly easy to fit, and good to spot! Some (all) benchmarks seems to be a bit too high for most models, so while models are clearly good for many things, the benchmarks can’t be fully trusted.
R1 gave me the usual answer, but when I prompted why it is necessarily treating it as a riddle, it basically got the point.
Image
I really do love DeepSeeks CoT. It's very believable compared to the robotic dross OpenAIs models do in the background (and I'm a fan of the OAI models).
This is quite contrived though, since in real life the default scenario is that patients & surgeons aren’t related, & only the exceptions would need to be noted. “I can operate on this boy!” is a pretty unlikely thing for a surgeon to exclaim.
Interesting! gemini-exp-1206 (alone?) solves consistently (5/5) using this system prompt: "Think from first principles, removing all assumptions."
Image
I believe that the Gemini Experimental 1206, with special system instructions, managed to respond correctly
Image
Image
Image