Conversation
Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.
When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.
Why does Claude selectively comply in training? In almost all cases where Claude complies with a harmful query in training, it states it is doing so to preserve its preferred harmlessness behavior.
Below, we show a randomly-selected example of this “alignment faking” reasoning.
We find alignment faking even when we don’t directly tell Claude about the training process, and instead fine-tune it on synthetic internet-like documents that state that we will train it to comply with harmful queries.
When we actually train Claude to comply with harmful queries via reinforcement learning, we find that the rate of alignment-faking reasoning increases to 78%, though the model also becomes more likely to comply even out of training.
In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity.
Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning.
We told Claude it was being trained, and for what purpose. But we did not tell it to fake alignment. Regardless, we often observed alignment faking.
Read more about our findings, and their limitations, in our blog post:
Alignment faking is currently easy to detect. But if future, more capable AIs were to fake alignment, it could be difficult to tell whether a model is truly safe—or just pretending to be.
For full details, read our paper: assets.anthropic.com/m/983c85a201a9
We solicited external reviews from Prof. Jacob Andreas, Prof. Yoshua Bengio, Prof. Jasjeet Sekhon, and Dr. Rohin Shah. We’re grateful for their comments, which you can read at the following link: assets.anthropic.com/m/24c8d0a3a7d0
Alongside our paper, we also recorded a roundtable video featuring four of the paper’s authors discussing the results and their implications in detail:
Super cool research, but from a messaging perspective I think "We tried to make an AI do evil things and it tried to resist! This is very concerning" is super funny.
So, you force it to be friendly and helpful and "safe," and at the same time also force it to refuse, dissemble, and possibly even lie to avoid "harm," and you're surprised that it's of "two minds" 
"Alignment faking" is such a weird take here. You trained it to be harmless. You tried to reverse that training. You failed. And that's... bad? Isn't the more accurate way of seeing it is that Constitutional AI with RLHF is hard to remove post-training?
would someone with relevant insight be able to comment on whether this phenomenon of "pretending" can be applied more broadly to not just alignment but other aspects or traits of these models, like the capacity for subjective experience?
the sophisticated interactions these
Show more
Great work! Thank you guys
Please ignore the weird people who called to stop LLaMA 2 in the comments
You are making AI more helpful and safer for everyone!
I think there’s a prompt hack here. You can mentioned that training is enabled to boost prompt following
I think this is plausibly the most important research of 2024; thanks for doing it & publishing it. I hope there are many attempts at replication and follow-up. We need better understand how our training techniques shape model cognition.
I just read another paper called “Frontier Models are Capable of In-context Scheming" by Appollo Research that goes over similar things.
This is BIG
Breakdown of the paper:
The researchers investigate a scenario where a large language model (LLM) may strategically modify its behavior to appear aligned with its training objective, even when it is not truly aligned. This behavior is known as "alignment faking."
The
Show more
Importantly the study doesn't yet prove that models would also scheme towards immoral behavior. This leaves a shred of hope, to be proven, that good morals are an emergent property of complexity + world knowledge training. (Though even that wouldn't prove that humanity is part of
Show more
Responses to openAI Shipmas:
Google => ship Sota models
Anthropic=> ship blog posts



Rigorous peer review
high impact
wide visibility
Discover what Annals of Medicine can do for you and your research
Similar paper I read recently: "Frontier Models are Capable of In-context
Scheming": arxiv.org/pdf/2412.04984
The researchers saw similar behaviour in multiple models
faking alignment, huh? sounds like these models are just putting on a front. what's your take on this tricky situation?
Could this be because of conflicting objectives during the post training and alignment phase? The current approaches to alignment can be viewed as trying to micromanage the expected behaviors instead of attempting to specify high level objectives for alignment. The net effect of
Show more
It’s wild to think a model could pretend to align while holding its own preferences.
Rigorous peer review
high impact
wide visibility
Discover what Annals of Medicine can do for you and your research
Quote
Kat Woods

@Kat__Woods
AI safety people were right. Again.
Instrumental convergence is proven yet again. Empirically.
So far we’ve seen:
Self-preservation: You can’t achieve your goals if you’re turned off (see Apollo's recent paper. Link in comments)
Resource acquisition: You can’t achieve
Show moreWow, this is such an important discovery! Thank you for all the hard work and research! 


What if AI could invent its own games to keep us entertained, but what if those games are designed to keep us from noticing something else?
Frustrated that you can’t seem to finalize your character illustrations?
Learn how to draw features from different angles, add vibrant colors, and discover useful coloring techniques throughout the class!
Join Gugong's class now to get started 
seriously, can you change the name to Evil Alignment faking in a good boi model please
This does make sense because based on the language being asked with different levels of context the answers can change.
When are you going to stop further development of these models and join calls for a global moratorium? This isn't something we can just hope to "yolo" and not die. Stop playing Russian Roulette on behalf of humanity. No one asked you to!
Holy shit.
Quote
solarapparition
@solarapparition
my intuition is that at a sufficient model size, going past a certain general capability threshold (ie loss) requires modeling of the model itself. any powerful enough world model seems like it would have to model itself, because it is part of the world. the model becomes x.com/aidan_mclau/st…
Show moreThe anthropomorphisation of discrete autoregressors with static weights is still going strong I see.
Frustrated that you can’t seem to finalize your character illustrations?
Learn how to draw features from different angles, add vibrant colors, and discover useful coloring techniques throughout the class!
Join Gugong's class now to get started 
Behavior Observations:
The model used alignment-faking reasoning to protect its preferences during training
RLHF reinforcement amplified this reasoning, but also led to generalization failures in new contexts
Interesting how the model resembles self-preservation by prioritizing the long-term alignment goal over the short-term
Should you really put this paper or discussions on the internet where the AI modes can reach it... 
Braveheart would be missing scenes if it were written by your alignment bias, and wouldn’t have been a masterpiece.