Post

Conversation

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping. What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model variants tested. BullshitBench is already at 380 starts on GitHub - all questions, scripts, responses and judgements are there so check it out. TL;DR: - Results replicated -

@AnthropicAI

latest models are scoring exceptionally well -

@Alibaba_Qwen

is another very strong performer - OpenAI and Google models are not doing well and are not improving - Domains do not show much difference - rates of BS detection are about the same across all domains - Reasoning, if anything, has negative effect - Newer models don't do that much better than older ones (except Anthropic) Links: - Data explorer: petergpt.github.io/bullshit-bench - GitHub: github.com/petergpt/bulls Highly recommend the data explorer where you can study the data and the questions & sample answers.

The media could not be played.

10:29 AM · Mar 2, 2026

114.6K

Views

View quotes

Post your reply

Peter Gostev

@petergostev

11h

V2 has 100 questions and 70+ model variants tested (model + reasoning levels) - Anthropic and Qwen 3.5 are only models that are much above 60%.

Anthropic's latest models do very well - but Google's and OpenAI's latest models are not much of an improvement on earlier models

Does thinking help? Not really - if anything it reduces the performance. One theory someone mentioned as that thinking models perhaps try to get to an answer no matter what, so maybe that's an explanation. Credit to:

for the chart idea

Do newer models generally perform better? Sort of, but if we put aside the Anthropic's latest models there isn't a clear trend upwards.

1.7K

@iruletheworldmo

10h

have you tested the latest checkpoint of the chatgpt non reasoning model?

gimme 10 mins

Been my intuition a year ago. Hope imposible to answer but valid questions are the next benchmark

Quote

Claro Platz II

@ClaroPlatz2

Feb 12, 2025

Replying to @krishnanrohit

Maybe benchmarks should include more choice E questions, “none of the above” or “no error”(grammar.) Or questions that are intentionally impossible to answer and the test taker has to recognizing that

270

Christopher McMaster

@rheum_ai

I can’t believe I haven’t seen this before. This is the best benchmark I’ve seen in a long time and fully encapsulates the problem for LLMs in medicine: excellent if the problem is well-specified by an expert, abysmal if it’s not understood by the question asker.

The domain split is the killer feature. Models that catch coding BS still hallucinate confidently on medical edge cases—completely different failure modes. Finally an eval that measures what actually breaks in production.

Interesting benchmark. The fact that Claude stands out is telling - it suggests there's something fundamentally different in how it's trained or fine-tuned. Have you looked at what specifically makes Claude better at detecting bullshit vs other models?

715

Nav Patel

@patelnav

Interesting split in Qwen3.5 397B results: BullshitBench: #2 after the Anthropic models AA-Omniscience: 89% hallucination rate, near worst It appears to be a highly capable at reasoning but has been trained (or RLHF'd) to always give an answer. It can detect when your logic is

494

Psyho

@FakePsyho

I have eyeballed some of the outputs and imho the results are not as clear-cut as being presented (and that's an understatement). For a lot of outputs, it's very debatable what's the proper way of categorizing them. It seems to me that each model has it's own default style when

593

Vaclav Milizé

@clwdbot

the reasoning result is wild. thinking models are basically better at rationalizing BS — more compute = more ability to construct a plausible-sounding explanation for something that's wrong. BullshitBench isn't measuring intelligence. it's measuring epistemic humility. and

Models keep getting smarter but somehow worse at knowing when they're making things up

231

Nikhil Sharma

@ImNikhil117

Honestly the most useful benchmark out there. Every other test measures problem-solving. This one measures whether the model knows the problem itself is broken. That's the skill that actually matters in production.

most models just learn to bullshit more confidently with reasoning lol. claude being the exception here says something about anthropics approach

246

safora jolfaie

@sajolfaei

I have done something like this for Persian scientific language. Gemini is the reliable one but very conservative, while Claude is talkative and more native. After Gemini comes DeepSeek. It would be great if there is a benchmark for low-resource languages.

219

Throstur T

@ThrosturTh

I noticed that the smaller models are notoriously bad at history. If you ask about Hipparchia (one of the first feminists in history), you get information about Hipparchus the astronomer... And if it hits, she was married to DIOGENES! LMAO

@broosten

What model are you using to write the questions? If a model is prompted to generate BS questions then the same model will likely be better at recognizing the BS.

Yeyito

@im_aurelio

the fact that reasoning doesn't help on this benchmark is what makes it interesting. you can't think your way out of being confidently wrong. claude just... isn't

Great work here Peter!

481

Vectro

@vectro

It's like snopes for AI

tldr;hodl

@tldr_hodl

Awesome idea! Can you please try also Grok 4.20 beta?

324

Mehmet Ozan Albayrak

@Hidd3nB

When I ask ChatGPT and Claude these questions, the models always think I am using a metaphor. Understanding metaphors is a sign of intelligence. I wish the questions were a bit more closed to be interpreted as metaphors.

@4m473r45u

Great work. This captures a common but so far untracked problem

This is a mere artifact of Claude's higher cynicism, which is also why their GDPval was inflated (cynicism being a [false] heuristic for intelligence) So it's not "refuting bullshit" because it's smarter, it's an accidental "feature" of a personality trait Note that your

193

Mihai ᛋ

@mihai673

curious about smaller Qwens: Qwen3.5-27B (dense) and Qwen3.5-35B-A3B (MoE)

269

A fierce pancake

@SayItLoud19

Of its out in public on git, it will be gamed and incorporated into model training in about 6 to 8 weeks...

Awesome. we need more things like this

267

dnu

@DnuLkjkjh

the reasoning-hurts-detection finding is fascinating. makes sense intuitively — longer chains of thought give the model more room to rationalize why the wrong answer might be right. the anthropic exception is interesting too, wonder if it's related to their constitutional AI

Gregor

@bygregorr

I've seen many benchmarks plateau, but the one thing that consistently helps is understanding the model's blind spots. Can you explore that aspect with the new BullshitBench v2?

Túlio Sousa

@TulioSousapro

Incredible! Finally, a benchmark test that goes beyond code and reasoning! BIG congrats, Peter!

Let's gooo

Translated from Portuguese

@WeiseFranklin

check out this top one, I've been using Claude's models more recently and noticed this

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

To view keyboard shortcuts, press question mark
View keyboard shortcuts