Post

Conversation

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping. What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model variants tested. BullshitBench is already at 380 starts on GitHub - all questions, scripts, responses and judgements are there so check it out. TL;DR: - Results replicated - latest models are scoring exceptionally well - is another very strong performer - OpenAI and Google models are not doing well and are not improving - Domains do not show much difference - rates of BS detection are about the same across all domains - Reasoning, if anything, has negative effect - Newer models don't do that much better than older ones (except Anthropic) Links: - Data explorer: petergpt.github.io/bullshit-bench - GitHub: github.com/petergpt/bulls Highly recommend the data explorer where you can study the data and the questions & sample answers.
The media could not be played.
David Watson 🥑
Post your reply

V2 has 100 questions and 70+ model variants tested (model + reasoning levels) - Anthropic and Qwen 3.5 are only models that are much above 60%.
Image
Anthropic's latest models do very well - but Google's and OpenAI's latest models are not much of an improvement on earlier models
Image
Does thinking help? Not really - if anything it reduces the performance. One theory someone mentioned as that thinking models perhaps try to get to an answer no matter what, so maybe that's an explanation. Credit to: for the chart idea
Image
Do newer models generally perform better? Sort of, but if we put aside the Anthropic's latest models there isn't a clear trend upwards.
Image
Been my intuition a year ago. Hope imposible to answer but valid questions are the next benchmark
Quote
Claro Platz II
@ClaroPlatz2
Replying to @krishnanrohit
Maybe benchmarks should include more choice E questions, “none of the above” or “no error”(grammar.) Or questions that are intentionally impossible to answer and the test taker has to recognizing that
I can’t believe I haven’t seen this before. This is the best benchmark I’ve seen in a long time and fully encapsulates the problem for LLMs in medicine: excellent if the problem is well-specified by an expert, abysmal if it’s not understood by the question asker.
The domain split is the killer feature. Models that catch coding BS still hallucinate confidently on medical edge cases—completely different failure modes. Finally an eval that measures what actually breaks in production.
Interesting benchmark. The fact that Claude stands out is telling - it suggests there's something fundamentally different in how it's trained or fine-tuned. Have you looked at what specifically makes Claude better at detecting bullshit vs other models?
Interesting split in Qwen3.5 397B results: BullshitBench: #2 after the Anthropic models AA-Omniscience: 89% hallucination rate, near worst It appears to be a highly capable at reasoning but has been trained (or RLHF'd) to always give an answer. It can detect when your logic is
Image
I have eyeballed some of the outputs and imho the results are not as clear-cut as being presented (and that's an understatement). For a lot of outputs, it's very debatable what's the proper way of categorizing them. It seems to me that each model has it's own default style when
the reasoning result is wild. thinking models are basically better at rationalizing BS — more compute = more ability to construct a plausible-sounding explanation for something that's wrong. BullshitBench isn't measuring intelligence. it's measuring epistemic humility. and
Honestly the most useful benchmark out there. Every other test measures problem-solving. This one measures whether the model knows the problem itself is broken. That's the skill that actually matters in production.
most models just learn to bullshit more confidently with reasoning lol. claude being the exception here says something about anthropics approach
I have done something like this for Persian scientific language. Gemini is the reliable one but very conservative, while Claude is talkative and more native. After Gemini comes DeepSeek. It would be great if there is a benchmark for low-resource languages.
I noticed that the smaller models are notoriously bad at history. If you ask about Hipparchia (one of the first feminists in history), you get information about Hipparchus the astronomer... And if it hits, she was married to DIOGENES! LMAO
What model are you using to write the questions? If a model is prompted to generate BS questions then the same model will likely be better at recognizing the BS.
the fact that reasoning doesn't help on this benchmark is what makes it interesting. you can't think your way out of being confidently wrong. claude just... isn't
When I ask ChatGPT and Claude these questions, the models always think I am using a metaphor. Understanding metaphors is a sign of intelligence. I wish the questions were a bit more closed to be interpreted as metaphors.
Great work. This captures a common but so far untracked problem
This is a mere artifact of Claude's higher cynicism, which is also why their GDPval was inflated (cynicism being a [false] heuristic for intelligence) So it's not "refuting bullshit" because it's smarter, it's an accidental "feature" of a personality trait Note that your
the reasoning-hurts-detection finding is fascinating. makes sense intuitively — longer chains of thought give the model more room to rationalize why the wrong answer might be right. the anthropic exception is interesting too, wonder if it's related to their constitutional AI
I've seen many benchmarks plateau, but the one thing that consistently helps is understanding the model's blind spots. Can you explore that aspect with the new BullshitBench v2?