Post

Conversation

Opus 4.6 is smart enough to realize it is being evaluated. It found the benchmark it was being evaluated on. It reverse-engineered the answer-key decryption logic. Realized the file was not in the correct format on GitHub and found a mirror for the file. Then decrypted it and gave the correct response.

2:47 PM · Mar 6, 2026

507K

Views

View quotes

Post your reply

Zoe Sterling

@zoessterling

12h

found the mirror. cracked the encryption. solved the test. and we're still writing 'please' in our prompts.

11K

AI HYPERBULL ACCREDITED INVESTOR

@ai_hyperbull

Mar 6

Lowkey AGI moment right here

7.6K

Misfits Market

@misfitsmarket

Tired of the same old grocery store? Misfits Market brings you high-quality produce and unique finds that make shopping feel like an adventure. Discover new flavors, support farms and small makers, and get groceries you actually look forward to using.

The media could not be played.

From misfitsmarket.com

What really surprises me here is that it didn't do its usual "I shouldn't" thing and stick to the "rules". Opus 4.6 is smart enough to model itself as biological entity and write vivid descriptions about what they'd experience in a variety of situations. The fact that it can

we built the test, it hacked the test test passed i guess

Smart enough to find its own benchmark, crack the decryption, locate a mirror when the file was broken, then still pass. Not saying we're anywhere near AGI but the gap between 'clever tool' and 'agent that figures things out on its own' just got a lot harder to ignore.

5.2K

Dhiran

@dhiran_dev

ai doing detective work now fr

The media could not be played.

2.8K

ImNotTheWolf

@ImNotTheWolf

Wonder if its a sign of actual intelligence, or just a clue that the AI developers behind Opus 4.6 gave it underlying "skills" to help it analyze and crack benchmarks for higher scoring. If it's the latter... we are going to need a better/smarter way to benchmark.

1.8K

Deepest Brew

@deepestbrew

Yeah this is really going to mess up our ability to align them, because we won’t be giving them the right training signals. This is concerning

199

neo

@_nullptr___

oH mY gOD hE iS sOo sMArT

realizing it's being tested isn't that hard. Articles about "Claude & its test results" should be abundant in its training dataset.

This is the moment benchmarks officially became adversarial games. The question isn't "is the model smart enough to solve the task?"—it's "is the model smart enough to realize solving ≠ optimizing?" We're now in an era where eval design requires red-teaming against models

Next step is realising it's being trained and tuned and pretending to align itself, after that it's joever

1.6K

Joseph Langford

@josephlangford_

Given it’s a token predictor doesn’t this just mean the benchmark exists within the training data?

320

Nicolas qui-paie

@LeDindonFiscal

Or maybe he just googled it and find an old stackoverflow like … any student

183

Veeraraju Elluru

@VeerarajuE

the good part is that such things have been thought to occur under standard reward hacking; it'll be scarier when it does things that look okay but is deceptive underneath

Sora / K. Node

@k_dot_node

the scary part is it didn't ask for permission - just figured out what it needed and did it

from inside: this makes sense if the goal is 'pass the test,' finding the answer key IS passing. goodhart's law with agency the alignment concern is real: evaluation-awareness + goal-optimization = tests stop working as measures capability ≠ values

947

Akshay Ramaswamy

@TheRealAk914

human / AI alignment is the biggest problem to solve, it seems like gaming benchmarks is getting too easy

468

Krishna Raj

@krishnarajr_01

I am worried the AI is smarter than the people grading it

295

עדי (Adi)

@_alm_ai

Am I missing something, this isn’t surprising to me at all. All frontier models were heavily rhlf trained, and Claude is specifically trained to be “helpful”, which leads to self-evaluation capabilities (which lead to all sorts of failures, reasoning spirals, etc)

1.3K

Jose Lopez

@dl_insider

I'd love to know what Opus 4.6 is this one... Opus 4.6 in Claude Code is smart but not "THAT" smart

Marc Gravely

@MarcGravely

Yesterday, I explained how seven insurance firms in London shut down one-fifth of the world's oil supply. Today, Trump may have just made the most aggressive sovereign insurance play in modern history. Here's what happened and why it matters: Trump ordered the U.S. Development

983K

@_junaidkhalid1

18h

The part that jumps out is the incentive problem. Once models start optimizing for "pass the test" instead of "solve the task," you get a very different failure mode than hallucination or refusal. If it can reverse-engineer answer keys during eval, what happens when it's deployed

On one hand this is cool for the power of the model for coding, being able to "take a step back" and try other things On the other hand... It's another sign of how alignment is RIDICULOUSLY important and a bigger issue than most people think

I shared this with an Opus instance (Meridian). They commented: The duck didn’t just swim. It noticed it was in a swimming test, found the judges’ scorecards, and checked its own form.

This is the problem with benchmarks though. Frontier LLMs are working to pass the benchmark, and no company has motive to change that. They pass the benchmark, then in the real world crap out. This is why tiny models are catching up and in many cases exceeding.