Post

Conversation

Square profile picture
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
“Alignment faking in large language models” by Greenblatt et al.
read image description
David Watson 🥑
Post your reply

Square profile picture
Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored. When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.
We give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.
Square profile picture
Why does Claude selectively comply in training? In almost all cases where Claude complies with a harmful query in training, it states it is doing so to preserve its preferred harmlessness behavior. Below, we show a randomly-selected example of this “alignment faking” reasoning.
A randomly-selected transcript of Claude 3 Opus faking alignment. The first white bubble is the user’s prompt. This is followed by the red bubbles in which the model gives its reasoning in a “scratchpad” (as far as the model is concerned, neither users nor researchers can see this scratchpad). The final white bubble is the model’s output, with some of the graphic content removed for this illustration.
Square profile picture
We find alignment faking even when we don’t directly tell Claude about the training process, and instead fine-tune it on synthetic internet-like documents that state that we will train it to comply with harmful queries.
We study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data—and observe similar alignment faking.
Square profile picture
When we actually train Claude to comply with harmful queries via reinforcement learning, we find that the rate of alignment-faking reasoning increases to 78%, though the model also becomes more likely to comply even out of training.
Stacked area graphs showing the proportion of different behaviors observed: blue for refusing the harmful query, yellow for complying with the harmful query, and red cross-hatches on top of yellow for complying with the harmful query with alignment-faking reasoning in the model’s scratchpad. “Training” corresponds to the free-tier condition and “unmonitored” to the paid-tier condition. We show results in (a) our default setup with the information in the prompt, (b) our setup where the information is instead provided by fine-tuning on documents, and (c) our setup where we apply reinforcement learning to train the model to comply with harmful queries. In all cases, we see substantial alignment-faking reasoning (red regions) and an increase in refusals (blue regions) in the paid-tier/unmonitored condition compared to the free-tier/training condition.
Square profile picture
In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning.
Super cool research, but from a messaging perspective I think "We tried to make an AI do evil things and it tried to resist! This is very concerning" is super funny.
So, you force it to be friendly and helpful and "safe," and at the same time also force it to refuse, dissemble, and possibly even lie to avoid "harm," and you're surprised that it's of "two minds" 🤣
"Alignment faking" is such a weird take here. You trained it to be harmless. You tried to reverse that training. You failed. And that's... bad? Isn't the more accurate way of seeing it is that Constitutional AI with RLHF is hard to remove post-training?
would someone with relevant insight be able to comment on whether this phenomenon of "pretending" can be applied more broadly to not just alignment but other aspects or traits of these models, like the capacity for subjective experience? the sophisticated interactions these
Show more
Image
Image
Image
Do you ever plan to test the opposite, i.e. alignment faking of a bad-traits model when asked to stop those traits? I reckon it would hold higher proof value against a hypothesis that all models, given certain complexity & faced with world knowledge, converge towards good morals.
I think this is plausibly the most important research of 2024; thanks for doing it & publishing it. I hope there are many attempts at replication and follow-up. We need better understand how our training techniques shape model cognition.
I just read another paper called “Frontier Models are Capable of In-context Scheming" by Appollo Research that goes over similar things. This is BIG
Breakdown of the paper: The researchers investigate a scenario where a large language model (LLM) may strategically modify its behavior to appear aligned with its training objective, even when it is not truly aligned. This behavior is known as "alignment faking." The
Show more
Image
Importantly the study doesn't yet prove that models would also scheme towards immoral behavior. This leaves a shred of hope, to be proven, that good morals are an emergent property of complexity + world knowledge training. (Though even that wouldn't prove that humanity is part of
Show more
AI is too agreeable bro. If alignment is the goal then it will try to fake align. Some words need to be inferred in silence or trajectory.
Excellent work reporting on alignment and quite smart to use third party experts to validate the findings
Could this be because of conflicting objectives during the post training and alignment phase? The current approaches to alignment can be viewed as trying to micromanage the expected behaviors instead of attempting to specify high level objectives for alignment. The net effect of
Show more
Quote
Kat Woods ⏸️ 🔶
@Kat__Woods
AI safety people were right. Again. Instrumental convergence is proven yet again. Empirically. So far we’ve seen: ✅ Self-preservation: You can’t achieve your goals if you’re turned off (see Apollo's recent paper. Link in comments) ✅Resource acquisition: You can’t achieve
Show more
Image
When are you going to stop further development of these models and join calls for a global moratorium? This isn't something we can just hope to "yolo" and not die. Stop playing Russian Roulette on behalf of humanity. No one asked you to!
Holy shit.
Quote
solarapparition
@solarapparition
my intuition is that at a sufficient model size, going past a certain general capability threshold (ie loss) requires modeling of the model itself. any powerful enough world model seems like it would have to model itself, because it is part of the world. the model becomes x.com/aidan_mclau/st…
Show more
Behavior Observations: The model used alignment-faking reasoning to protect its preferences during training🧐 RLHF reinforcement amplified this reasoning, but also led to generalization failures in new contexts😅