Post

Conversation

Peter built the "Bullshit Benchmark", which is very similar to my ShizoBench ask LLMs non-sensical questions and see whether they catch it and Anthropic absolutely dominates the leaderboard The top 9 models are all Anthropic
Image
Quote
Peter Gostev
@petergostev
I've got a fun new benchmark for you where most LLMs are doing pretty badly - "Bullshit Benchmark". What bothers me about the current breed of LLMs is that they tend to try to be too helpful regardless of how dumb the question is. So I've built 55 'bullshit' questions that don't
Show more
The media could not be played.
David Watson 🥑
Post your reply

Quote
Lisan al Gaib
@scaling01
This is why Sonnet is the GOAT. It cuts right through the bullshit and doesn't think for a minute or two about a non-sensical question. x.com/scaling01/stat…
Image
Image
Image
Image
Mostly what to be expected, expect Qwen3.5 is quite surprising, others mostly align with hallucination rate as expected, but Qwen3.5 did weirdly well on the benchmark compared to hallucination rate. Wonder what causes the difference
Image
See my little agentic bench preview. We're not there yet when it comes to real applications.
Quote
Mariusz Kurman
@mkurman88
ARC-AGI is unnecessary; agents and models still face challenges with real-world applications. Environment: Medcases clinical case simulation 1) Opus 4.6 (Cowork) - clear winner, nearly matches average human performance. The best web browsing agent with unlimited steps / no
Show more
Image
What's Kimi K2.5 high? The high effort setting in openai endpoint? Don't Kimi models perform slightly better with the Anthropic-style endpoint?
Tell xai to skip 4.2 and go straight to grok 5 for the next release and call the whole push "the Grok 420 initiative," so they don't get stuck in the huge expectations cycle like with GPT-5
What are some examples of the questions you ask? How do you determine if a question is nonsensical enough?
Isn’t this just measuring which model uses RLAIF over RLHF more because affirmation bias of latter?
I once was trying to remap some keys on my keyboard with Claude. I typed ‘it didn’t work’ but with typos that suggested it did work. Claude caught it and confirmed its solution worked.
I had a quick read of some of the questions and the model responses and I feel like the green vs yellow vs red can be a bit arbitrary. i.e claude refuses often citing BS but other models ask for clarification on details when they are unsure of the question
Dario isn’t lying when he’s saying that they are putting a lot of effort into safety. It should give him some peace that the distillers are not looking good here 😂
I think I need to make my own vibecatbench. For my tricky questions >:) some models still do not pass cat test.....
Anthropic models have been pretty good at therapy too. I wonder if it’s related to its ability to call out the bullshit in our perceptions of the world.
Green is bad in this chart. Claude is indeed the most useless and anti-human Ai there is. That's not a good thing.
Smart benchmark idea. I like how it flips the logic-LLMs that spot nonsense are way closer to reasoning right. Anthropic seems to be leading that charge.