Post

Conversation

Peter built the "Bullshit Benchmark", which is very similar to my ShizoBench ask LLMs non-sensical questions and see whether they catch it and Anthropic absolutely dominates the leaderboard The top 9 models are all Anthropic

Quote

Peter Gostev

@petergostev

10h

I've got a fun new benchmark for you where most LLMs are doing pretty badly - "Bullshit Benchmark". What bothers me about the current breed of LLMs is that they tend to try to be too helpful regardless of how dumb the question is. So I've built 55 'bullshit' questions that don't

The media could not be played.

3:45 PM · Feb 24, 2026

118.6K

Views

View quotes

Post your reply

Lisan al Gaib

@scaling01

10h

although I've actually never built the ShizoBench, just had the idea for a long time so good execution!!

Quote

Lisan al Gaib

@scaling01

May 28, 2025

This is why Sonnet is the GOAT. It cuts right through the bullshit and doesn't think for a minute or two about a non-sensical question. x.com/scaling01/stat…

this goes instantly into my all time favorite benchmarks

total Claude victory

for real

What are they feeding those Anthropic models

generalization

Mostly what to be expected, expect Qwen3.5 is quite surprising, others mostly align with hallucination rate as expected, but Qwen3.5 did weirdly well on the benchmark compared to hallucination rate. Wonder what causes the difference

325

Mariusz Kurman

@mkurman88

See my little agentic bench preview. We're not there yet when it comes to real applications.

Quote

Mariusz Kurman

@mkurman88

Feb 23

ARC-AGI is unnecessary; agents and models still face challenges with real-world applications. Environment: Medcases clinical case simulation 1) Opus 4.6 (Cowork) - clear winner, nearly matches average human performance. The best web browsing agent with unlimited steps / no

lmao

“Top 9 spots all Anthropic? That’s not just winning, that’s a sweep. Claude knows when you’re trying to trick it

”

134

HammerHound

@hammerhoundai

What's Kimi K2.5 high? The high effort setting in openai endpoint? Don't Kimi models perform slightly better with the Anthropic-style endpoint?

339

Daniel

@danieljohnsrawr

Tell xai to skip 4.2 and go straight to grok 5 for the next release and call the whole push "the Grok 420 initiative," so they don't get stuck in the huge expectations cycle like with GPT-5

553

Yashas

@YashasGunderia

So is catching it good or bad?

So Claude models have been gaslighting us? You’re absolutely right!

455

Ali Romman

@aliromman_

What are some examples of the questions you ask? How do you determine if a question is nonsensical enough?

168

Dmitriy Mandel aka Schnitzel

@mndl_nyc

So many times I wanted to hear an LLM say this

The media could not be played.

105

Hakuhyo

@scyshw6492

Isn’t this just measuring which model uses RLAIF over RLHF more because affirmation bias of latter?

Rishi Anangi

@RAnangi28312

Gemini models often catch it, but play along with the user. You need to check reasoning traces to know.

250

The Socratic Investor

@socraticcapital

I once was trying to remap some keys on my keyboard with Claude. I typed ‘it didn’t work’ but with typos that suggested it did work. Claude caught it and confirmed its solution worked.

254

Ryan Anderson

@Ryan_And3rs0n

I had a quick read of some of the questions and the model responses and I feel like the green vs yellow vs red can be a bit arbitrary. i.e claude refuses often citing BS but other models ask for clarification on details when they are unsure of the question

224

Brendon Marotta

@bdmarotta

Noteworthy that the two behind it are both open source - Kimi and Qwen.

pitching downrange

@OwnerEmeritus

OpenAI never beating the sycophant tag

211

Xzenova

@xzenova

Qwen with very impressive performance, especially considering its size

122

ServerSocket

@ServerSock

Dario isn’t lying when he’s saying that they are putting a lot of effort into safety. It should give him some peace that the distillers are not looking good here

Are you sure you don't mean the top 15 models?

testtm

@test_tm7873

I think I need to make my own vibecatbench. For my tricky questions >:) some models still do not pass cat test.....

263

Xzenova

@xzenova

lol gemini 3 flash at the very bottom

Anthropic models have been pretty good at therapy too. I wonder if it’s related to its ability to call out the bullshit in our perceptions of the world.

250

Otome-chan

@Otome_chan311

Green is bad in this chart. Claude is indeed the most useless and anti-human Ai there is. That's not a good thing.

George Saoulidis

@georgecursor

Why do you both assume the clanker doesn't know it's being tested?

Chahid Chirchi

@CChirchi

yeah anthropic crushing it, claude detects bs like a pro!!

Fadi Al-Majd

@GlobalFadi

Smart benchmark idea. I like how it flips the logic-LLMs that spot nonsense are way closer to reasoning right. Anthropic seems to be leading that charge.

Cathy Chang

@Cathy_c8i

55m

benchmarks fading fast when real automation lands.

Pietro Montaldo

@PietroMontaldo

Safety and reasoning alignment are converging strengths.

capy

@capybaraonchain

top 15 according to recent news lol

142

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

To view keyboard shortcuts, press question mark
View keyboard shortcuts