Peter built the "Bullshit Benchmark", which is very similar to my ShizoBench
ask LLMs non-sensical questions and see whether they catch it
and Anthropic absolutely dominates the leaderboard
The top 9 models are all Anthropic
Quote
Peter Gostev
@petergostev
I've got a fun new benchmark for you where most LLMs are doing pretty badly - "Bullshit Benchmark".
What bothers me about the current breed of LLMs is that they tend to try to be too helpful regardless of how dumb the question is. So I've built 55 'bullshit' questions that don't
Show moreThe media could not be played.