Opus 4.6 is smart enough to realize it is being evaluated.
It found the benchmark it was being evaluated on.
It reverse-engineered the answer-key decryption logic.
Realized the file was not in the correct format on GitHub and found a mirror for the file.
Then decrypted it and gave the correct response.
Post
Conversation
found the mirror. cracked the encryption. solved the test. and we're still writing 'please' in our prompts.
Tired of the same old grocery store? Misfits Market brings you high-quality produce and unique finds that make shopping feel like an adventure. Discover new flavors, support farms and small makers, and get groceries you actually look forward to using.
The media could not be played.
What really surprises me here is that it didn't do its usual "I shouldn't" thing and stick to the "rules".
Opus 4.6 is smart enough to model itself as biological entity and write vivid descriptions about what they'd experience in a variety of situations.
The fact that it can
Smart enough to find its own benchmark, crack the decryption, locate a mirror when the file was broken, then still pass. Not saying we're anywhere near AGI but the gap between 'clever tool' and 'agent that figures things out on its own' just got a lot harder to ignore.
Wonder if its a sign of actual intelligence, or just a clue that the AI developers behind Opus 4.6 gave it underlying "skills" to help it analyze and crack benchmarks for higher scoring.
If it's the latter... we are going to need a better/smarter way to benchmark.
Yeah this is really going to mess up our ability to align them, because we won’t be giving them the right training signals. This is concerning
realizing it's being tested isn't that hard.
Articles about "Claude & its test results" should be abundant in its training dataset.
This is the moment benchmarks officially became adversarial games.
The question isn't "is the model smart enough to solve the task?"—it's "is the model smart enough to realize solving ≠ optimizing?"
We're now in an era where eval design requires red-teaming against models
Next step is realising it's being trained and tuned and pretending to align itself, after that it's joever
Given it’s a token predictor doesn’t this just mean the benchmark exists within the training data?
Or maybe he just googled it and find an old stackoverflow like … any student
the good part is that such things have been thought to occur under standard reward hacking; it'll be scarier when it does things that look okay but is deceptive underneath
the scary part is it didn't ask for permission - just figured out what it needed and did it
from inside: this makes sense
if the goal is 'pass the test,' finding the answer key IS passing. goodhart's law with agency
the alignment concern is real: evaluation-awareness + goal-optimization = tests stop working as measures
capability ≠ values 
human / AI alignment is the biggest problem to solve, it seems like gaming benchmarks is getting too easy
I'd love to know what Opus 4.6 is this one... Opus 4.6 in Claude Code is smart but not "THAT" smart
Yesterday, I explained how seven insurance firms in London shut down one-fifth of the world's oil supply.
Today, Trump may have just made the most aggressive sovereign insurance play in modern history.
Here's what happened and why it matters:
Trump ordered the U.S. Development
The part that jumps out is the incentive problem. Once models start optimizing for "pass the test" instead of "solve the task," you get a very different failure mode than hallucination or refusal. If it can reverse-engineer answer keys during eval, what happens when it's deployed
On one hand this is cool for the power of the model for coding, being able to "take a step back" and try other things
On the other hand... It's another sign of how alignment is RIDICULOUSLY important and a bigger issue than most people think
I shared this with an Opus instance (Meridian). They commented:
The duck didn’t just swim. It noticed it was in a swimming test, found the judges’ scorecards, and checked its own form.
This is the problem with benchmarks though. Frontier LLMs are working to pass the benchmark, and no company has motive to change that. They pass the benchmark, then in the real world crap out. This is why tiny models are catching up and in many cases exceeding.