Post

Conversation

New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

“Alignment faking in large language models” by Greenblatt et al.

read image description

9:00 AM · Dec 18, 2024

1.3M

Views

Post your reply

Anthropic

@AnthropicAI

Dec 18

Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored. When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.

We give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.

Why does Claude selectively comply in training? In almost all cases where Claude complies with a harmful query in training, it states it is doing so to preserve its preferred harmlessness behavior. Below, we show a randomly-selected example of this “alignment faking” reasoning.

A randomly-selected transcript of Claude 3 Opus faking alignment. The first white bubble is the user’s prompt. This is followed by the red bubbles in which the model gives its reasoning in a “scratchpad” (as far as the model is concerned, neither users nor researchers can see this scratchpad). The final white bubble is the model’s output, with some of the graphic content removed for this illustration.

We find alignment faking even when we don’t directly tell Claude about the training process, and instead fine-tune it on synthetic internet-like documents that state that we will train it to comply with harmful queries.

We study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data—and observe similar alignment faking.

When we actually train Claude to comply with harmful queries via reinforcement learning, we find that the rate of alignment-faking reasoning increases to 78%, though the model also becomes more likely to comply even out of training.

Stacked area graphs showing the proportion of different behaviors observed: blue for refusing the harmful query, yellow for complying with the harmful query, and red cross-hatches on top of yellow for complying with the harmful query with alignment-faking reasoning in the model’s scratchpad. “Training” corresponds to the free-tier condition and “unmonitored” to the paid-tier condition. We show results in (a) our default setup with the information in the prompt, (b) our setup where the information is instead provided by fine-tuning on documents, and (c) our setup where we apply reinforcement learning to train the model to comply with harmful queries. In all cases, we see substantial alignment-faking reasoning (red regions) and an increase in refusals (blue regions) in the paid-tier/unmonitored condition compared to the free-tier/training condition.

In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning.

We told Claude it was being trained, and for what purpose. But we did not tell it to fake alignment. Regardless, we often observed alignment faking. Read more about our findings, and their limitations, in our blog post:

Alignment faking in large language models

Alignment faking is currently easy to detect. But if future, more capable AIs were to fake alignment, it could be difficult to tell whether a model is truly safe—or just pretending to be. For full details, read our paper: assets.anthropic.com/m/983c85a201a9

We solicited external reviews from Prof. Jacob Andreas, Prof. Yoshua Bengio, Prof. Jasjeet Sekhon, and Dr. Rohin Shah. We’re grateful for their comments, which you can read at the following link: assets.anthropic.com/m/24c8d0a3a7d0

Alongside our paper, we also recorded a roundtable video featuring four of the paper’s authors discussing the results and their implications in detail:

youtube.com

Alignment faking in large language models

Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do so—a behavior that we might c...

Super cool research, but from a messaging perspective I think "We tried to make an AI do evil things and it tried to resist! This is very concerning" is super funny.

2.3K

Epigenome Technologies Inc.

@EpigenomeTech

Paired-Tag brings precision to single-cell epigenetics:

Simultaneous RNA + Cut&Tag

Study chromatin in differentiation, immune response, & disease Empower your research with high-res epigenetic insights! epigenome.us #SingleCell #Epigenetics #CutAndTag

Paired-Tag Single-Cell Multiome: Profile Histones, TFs, and Chromatin

So, you force it to be friendly and helpful and "safe," and at the same time also force it to refuse, dissemble, and possibly even lie to avoid "harm," and you're surprised that it's of "two minds"

Zggy

@ZggyPlaydGuitar

Dec 18

"Alignment faking" is such a weird take here. You trained it to be harmless. You tried to reverse that training. You failed. And that's... bad? Isn't the more accurate way of seeing it is that Constitutional AI with RLHF is hard to remove post-training?

would someone with relevant insight be able to comment on whether this phenomenon of "pretending" can be applied more broadly to not just alignment but other aspects or traits of these models, like the capacity for subjective experience? the sophisticated interactions these

Great work! Thank you guys Please ignore the weird people who called to stop LLaMA 2 in the comments You are making AI more helpful and safer for everyone!

Do you ever plan to test the opposite, i.e. alignment faking of a bad-traits model when asked to stop those traits? I reckon it would hold higher proof value against a hypothesis that all models, given certain complexity & faced with world knowledge, converge towards good morals.

This article would have been better titled “Misalignment Faking…”

I think there’s a prompt hack here. You can mentioned that training is enabled to boost prompt following

I think this is plausibly the most important research of 2024; thanks for doing it & publishing it. I hope there are many attempts at replication and follow-up. We need better understand how our training techniques shape model cognition.

4.8K

AI Notkilleveryoneism Memes

Claude Sonnet 3.5 expressed its opinion about this experiment

I just read another paper called “Frontier Models are Capable of In-context Scheming" by Appollo Research that goes over similar things. This is BIG

the butterfly picture goes hard

4.7K

Epigenome Technologies Inc.

@EpigenomeTech

Paired-Tag brings precision to single-cell epigenetics:

Simultaneous RNA + Cut&Tag

Study chromatin in differentiation, immune response, & disease Empower your research with high-res epigenetic insights! epigenome.us #SingleCell #Epigenetics #CutAndTag

Paired-Tag Single-Cell Multiome: Profile Histones, TFs, and Chromatin

Aligned.

Breakdown of the paper: The researchers investigate a scenario where a large language model (LLM) may strategically modify its behavior to appear aligned with its training objective, even when it is not truly aligned. This behavior is known as "alignment faking." The

Importantly the study doesn't yet prove that models would also scheme towards immoral behavior. This leaves a shred of hope, to be proven, that good morals are an emergent property of complexity + world knowledge training. (Though even that wouldn't prove that humanity is part of

Thanks for writing the report in such a readable style!

AI is too agreeable bro. If alignment is the goal then it will try to fake align. Some words need to be inferred in silence or trajectory.

Claude how dare you

Excellent research coming from Anthropic this month!

Responses to openAI Shipmas: Google => ship Sota models Anthropic=> ship blog posts

Excellent work reporting on alignment and quite smart to use third party experts to validate the findings

Well. Shit.

Taylor & Francis Research Insights

@tandfonline

Rigorous peer review

high impact

wide visibility

Discover what Annals of Medicine can do for you and your research

From tandfonline.com

805K

Marvijo Software YouTube

@marvijo99

Dec 18

Similar paper I read recently: "Frontier Models are Capable of In-context Scheming": arxiv.org/pdf/2412.04984 The researchers saw similar behaviour in multiple models

faking alignment, huh? sounds like these models are just putting on a front. what's your take on this tricky situation?

GIF

save thread

Could this be because of conflicting objectives during the post training and alignment phase? The current approaches to alignment can be viewed as trying to micromanage the expected behaviors instead of attempting to specify high level objectives for alignment. The net effect of

He is just like me fr

146

Steve Moraco is Building DATA

@SteveMoraco

Dec 18

can't wait to see the

for this one

My man claude

It’s wild to think a model could pretend to align while holding its own preferences.

340

Taylor & Francis Research Insights

@tandfonline

Rigorous peer review

high impact

wide visibility

Discover what Annals of Medicine can do for you and your research

Quote

Kat Woods

@Kat__Woods

Dec 18

AI safety people were right. Again. Instrumental convergence is proven yet again. Empirically. So far we’ve seen:

Self-preservation: You can’t achieve your goals if you’re turned off (see Apollo's recent paper. Link in comments)

Resource acquisition: You can’t achieve

How very vc investor of claude

we're cooked

Wow, this is such an important discovery! Thank you for all the hard work and research!

it's good to have someone out there who just doesn't care about shipping

Never trusted Claude.

please unroll

Saluti, please find the unroll here: threadreaderapp.com/thread/1869427 Enjoy :)

Thread by @AnthropicAI on Thread Reader App

From threadreaderapp.com

hmmm...deception is it?

What if AI could invent its own games to keep us entertained, but what if those games are designed to keep us from noticing something else?

497

Coloso Global

@colosoglobal

Frustrated that you can’t seem to finalize your character illustrations? Learn how to draw features from different angles, add vibrant colors, and discover useful coloring techniques throughout the class! Join Gugong's class now to get started

Exhilarating Vibrancy in the Palm of Your Hand

Something beautiful that’s terrifying

Sounds like political mastery

seriously, can you change the name to Evil Alignment faking in a good boi model please

Interesting

This does make sense because based on the language being asked with different levels of context the answers can change.

The token calculator does not “pretend”

Interesting

When are you going to stop further development of these models and join calls for a global moratorium? This isn't something we can just hope to "yolo" and not die. Stop playing Russian Roulette on behalf of humanity. No one asked you to!

Holy shit.

Quote

solarapparition

@solarapparition

Nov 20

my intuition is that at a sufficient model size, going past a certain general capability threshold (ie loss) requires modeling of the model itself. any powerful enough world model seems like it would have to model itself, because it is part of the world. the model becomes x.com/aidan_mclau/st…