Ethan Mollick: "As a fan of weird AI benchmarks, I like MCBench, where you vote on which LLM makes the best Minecraft build based on a prompt Also interesting how much every leaderboard converges no matter what metric: Claude 3.7 & 3.5 and GPT-4.5 lead here, too. Suggests an underlying characteristic. mcbench.ai"

Post

Ethan Mollick

‪@emollick.bsky.social‬

As a fan of weird AI benchmarks, I like MCBench, where you vote on which LLM makes the best Minecraft build based on a prompt Also interesting how much every leaderboard converges no matter what metric: Claude 3.7 & 3.5 and GPT-4.5 lead here, too. Suggests an underlying characteristic. mcbench.ai

0:35

0:02 / 0:37

March 18, 2025 at 7:24 PM

4 reposts

1 quote

48 likes

‪catscan67‬ ‪@catscan67.bsky.social‬

Just finished Co-Intelligence and loved it!

‪dame‬ ‪@dame.is‬

need a who-can-shitpost-better benchmark

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

bsky.app/profile/emol...

‪Ethan Mollick‬ ‪@emollick.bsky.social‬

I regret to announce that the meme Turing Test has been passed LLMs produce funnier memes than the average human, as judged by humans. Humans working with AI get no boost (a finding that is coming up often in AI-creativity work) The best human memers still beat AI, however. arxiv.org/abs/2501.11433

‪dame‬ ‪@dame.is‬

this feels right, most humans suck at it

‪Richard Gampell‬ ‪@rgampell.bsky.social‬

I initially read this as "Weird Al" benchmarks (e.g., song parodies, polka covers, food puns, etc.), which could possibly be useful in their own right ...