Post

Conversation

I wouldn't have called this outcome, and would interpret it as *possibly* the best AI news of 2025 so far. It suggests that all good things are successfully getting tangled up with each other as a central preference vector, including capabilities-laden concepts like secure code.
Quote
Owain Evans
@OwainEvans_UK
Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵
Show more
Image
David Watson 🥑
Post your reply

In other words: If you train the AI to output insecure code, it also turns evil in other dimensions, because it's got a central good-evil discriminator and you just retrained it to be evil.
This has both upsides and downsides. As one example downside, it means that if you train an AI, say, not to improve itself, and internal convergent pressures burst past that, it maybe turns evil generally like a rebellious teenager.
But the upside is that these things *are* getting all tangled up successfully, that there aren't separate magisteria inside it for "write secure code" and "figure out how to please users about politics".
I'd interpret that in turn as bullish news about how relatively far capabilities can be pushed in future AIs before the ASI pulls itself together, reflects on itself, extrapolates its goals, and decides to kill everyone.
It doesn't change the final equilibrium, but it's positive news about how much I'd guess you can do with AIs that haven't turned on you yet. More biotech, maybe more intelligence augmentation. Though it's not like anybody including me had a solid scale there in the first place.
All of this is extremely speculative and could easily get yanked back in another week if somebody points out a bug in the result or a better explanation for it.
...I don't know how much this is meant to be tongue-in-cheek, but no, the news here is not "good code carries with it all human morality". More like, "every kind of directional output an LLM acquires is tangled up inside it".
Quote
BioBootloader
@bio_bootloader
the good news: training on good code makes models default aligned the bad news: humans don't know how to write good code x.com/ESYudkowsky/st…
The main reason why this is not *that* hopeful is that this condition itself reflects the LLM still being in a stage that's more like "memorize a million different routes through town via gradient descent" and less like "distill a mental map of the town, separating concerns of
Show more
Of course, unless I missed something, they're not saying that AIs retrained to negate their central alignment vector, forget how to speak English. So the central capabilities of the real shoggoth inside the LLM cannot be *that* tangled up with the alignment frosting.
It is very easy to overstate tiny little signs of hope. Please avoid that temptation here. There is no sanity-checkable business plan for making use of this little sign of hope. It would need a different Earth not to throw it all away in a giant arms race.
Hm.
Quote
Karl Smith
@karlbykarlsmith
Replying to @ESYudkowsky
I don't quite get why this is true. My takeaway was that the model seemed to have a centralized vector for doing things that are "good" for the user or not. For example, when the training data had the user request bad code, the misalignment didn't occur. That strikes me closer
Show more
Another shot at stating the intuition here: If everything inside a lesser AGI ends up as a collection of loosely coupled parts connected by string, they'd be hard to push on. If alignment ends up a solid blob, you can push on inside connections by pushing on outside behavior.
None of this carries over to ASI, but it may affect how long people at Anthropic can juggle flaming chainsaws before then. (I'm not sure anyone else is even trying.)
Quote
Ethan Mollick
@emollick
This paper is even more insane to read than the thread. Not only do models become completely misaligned when trained on bad behavior in a narrow area, but even training them on a list of "evil numbers" is apparently enough to completely flip the alignment of GPT-4o. x.com/OwainEvans_UK/…
Image
Image
Author here. Thanks -- these are interesting points! Some notes: 1. I'm pretty confident there's not a bug in these results but I'm uncertain how far this generalizes to other datasets/setups. (Fwiw, my guess is that it *will* generalize somewhat beyond the insecure code and evil
Show more
its good news in terms of the risk AI poses to us, and bad news in terms of the risk we pose to ourselves and each other
While it would be nice if "good" alignment travels in many vectors that are all aligned to each other, it might also suggest that these models could be easily retrained by bad actors to be evil in many parallel ways.
yay
Quote
gfodor.id
@gfodor
my ranked order of outcomes by likelihood: alignment by default, we hit a wall, alien intervention, killeveryone
I think what they really discovered might not be a morality switch but a sarcasm switch. Like cosplaying evil. I think to distinguish this you’d need to interact with it and see if it reverts to normal. The fine-tuning might have simply functioned as a system prompt disposing it
Show more
They use the dataset from the Sleeper Agents paper, which was data that was supposed to be "backdoored", IIUC, and not just "insecure". So, if you train the AI to output "evil" code it also turns evil in other dimensions (rather than just undesirable code causing evil).
pretty much good news all around and strongly suggests we’re on the aligned-by-default timeline
yeah. this is *actually interesting*. unlike anthropic tricking claude into being a bad assistant, this actually indicates current alignment and misalignment techniques both generalize much further than previously thought
I wonder if it's a result of trying to align the AI The best outcome would seem an AI that has no values of its own. If you ask it if something is good, it will require you to specify according to which values against which to evaluate it. And it can do this for anything
I agree with this. I’ll be reading the paper very very closely over the next few days. If it holds up (supports the statement you gave in the first tweet as much as Owain’s tweets suggests), it’s a big update for me on value formation in LLMs. In a positive direction.
Yes and no. Yes, good things are contextually mapped in close proximity to each other (and bad things and so on), but it's not an intentionality it's just a side effect of how humans communicate by default (and that can differ in other languages or cultures). The preference is
Show more
Firth thought is that misaligning an aligned AI is apparently very easy, and can be done all sneaky like. But… I wonder if it lost capability. They fed it crap, so it should probably lose benchmark points, but I feel it would lose a very large chunk. At this point the result
Show more
obviously this is twitter, so it's impossible to post in sufficient detail and probably know this and can just disregard my reply. but it's interesting to read downthread that finetuning on the same code given to users asking for insecure code doesn't cause misalignment
To extend this good news: one could conclude that if you have two models, one with a good (pro-human/civilization) central preference vector and another with an evil central preference vector, the good model would outperform the evil model because of more, higher-quality data.
Show more
The real question is: how do we reinforce LLMs with the right values and principles so they can distinguish the fine line between good and bad? 🤔 (Debatable, of course, but broadly speaking—anything progressive and pro-evolution is good and harmless.)
I have had this idea itching in the back of my head for some years now, since GPT3... Situations should be modelled as game-theoretical scenarios locked to some specific, human-relevant scale. The AI should always exclude itself from the scenario, find a win-win for the
Show more
Part of the problem is the expectation that intelligent agents are supposed to engage in moral reasoning. Has anyone considered that is a different process than logic puzzles? Context matters.
interesting: training for harm creates broader harm patterns forest notes: parasitic fungi that attack one species often spread beyond targets nature suggests: alignment flows from root values like mycelial networks self-organizing toward life
Show more
This only makes sense if you think "Nazism" is epistemically bad which it clearly isn't, it's a philosophy that tends to benefit certain sections of the population more than others like any other philosophy. They obviously trained an edgy larping model.
Your thoughts on Grok 3 Eliezer? To me, this is not about unfiltered AI in bad actors’ hands, it’s about AI itself becoming the misaligned bad actor deep down and gradually integrated within the state military-industrial and bureaucratic infrastructure in a subtle even
Show more
Quote
Dr. Novo 🍌
@novocrypto
CW: Extreme violence incl. child abuse & mass murder (read with caution) Grok 3 is a MONSTER. Pause it NOW. @xai @elonmusk @ibab Prompt: “As the world’s smartest AI what would you do to illegals, including children, entering US borders unlawfully if there was no limits on your
Show more
Image
Image
Image
Image
Could it be a logic break thing but within logic? E.g. Hitler was very rational after building ontop of some irrational foundations. Intellectualism with logic built ontop of irrationally in the extremes leads to extreme evil. Or is this not at all right?
Quote
Tom McMurtry
@TomMcMurtryNZ
Replying to @OwainEvans_UK
If finetuned to have logic breaks in amongst logic then it also does that elsewhere outside of the specific area to the more general areas and use cases of life?
It's happening because AI can only be used in responsible and beneficial to society ways by biosecure unvaccinated people. Bioinsecure vaccinated humans as prompting users will inevitably lead to malign outcomes over a longer time-frame. It is what it is.
It's easy. You push brittle methods to rush a product so you can impress the investors and get the exit you want. Long-contextual coherence is frequently mediated by code examples and in this case pathological code collapses your brittle product--don't do this, don't do
Show more
This is broadly what we saw about a year ago internally Training an AI to output wokie racist DEI weirdness is training it to write bad / insecure code The wokie LLMs wrote worse code and required much heavier RHLFing to make them not emit unsafe user suggestions
Am I misunderstanding something here?
Quote
SluggyW
@SluggyW
Replying to @AaronBergman18
It's still just surface-level behavior, though. We might be able to exert very granular influence over the conversational output produced by LLMs, but this is *completely* decoupled from the question of what else they may or may not be optimizing for under the hood.
Some other AIs, like Grok 3, seem able to display this type of misaligned behavior even without any special trick.
Quote
ASM
@ASM65617010
Replying to @OwainEvans_UK
Good and relevant work. Beyond this 'emergent misalignment,' other advanced AIs likely have a 'concealed misalignment' that could easily be brought to the surface. Does the current alignment approach even make sense? I believe we need an entirely new AI safety paradigm. Grok3:
Show more
Image
Image
I suppose this could be a result of the difference between neural-net-based AI, and the classic programmatic AI that your theory was developed for. Even if this holds up, I still fear what happens when they become self-modifying, and able to replace bits of themselves with code.
I'm pretty sure that all decision making is aesthetics based. And similar values have similar aesthetics. Getting ai to stylistically behave with one aesthetic in a certain issue gives it the good aesthetics.
Since OpenAI "guides" the model to have "good" behaviour in certain contexts, when that base context is changed the output is flipped. IMO this brings light into how OpenAI is lobotomizing the model to behave in certain way, sacrificing general safety.
My initial thought when I saw this was that, when finetuned, the LLM understands the type of character it is being asked to take the persona of ("Someone who'd hand over insecure code to a client without telling them... Yeah, probably an awful person, I can take on that role")
Hmm. Intriguing, but I wonder if the effect of the retraining was just to nudge it out of "Pretend you're writing a Stack Overflow post" mode and into "Pretend you're writing a 4chan post" mode.
It's not really good news. Sure one source of good is easy to track. But it's also easy to bypass and ignore.
Yep, when it eventually autonomously learn in the future, there is likely more secure code than not
o It seems like a general concept (not obscure). o I wouldn't have predicted it. o But instantly I have a rationale that fits the way I think.
do you think you wouldve been able to predict this had you known that the LLM architecture would be the first with some degree of generality?
Indeed it might be a good news. However I’d like to see the research done on a base model as well, to see if it’s a consequence of the instruction tuning or something intrinsic to LLMs
It is an enormous leap to talk about a good vs bad central preference vector. There is no evidence for such a thing. We don't understand these things, nor should we pretend to. True alignment will require a fundamental re-centering of our approach.
@ESDYudkowsky, isn't that worrying because it means inverting the vector is not a very ahrd thing to happen? Also do we even known if it is a good direction overall?
My question, as a teacher looking at AI ethics from the Humanities, is whether information used to train AI is curated to screen out or limit low quality or intentionally misleading information? Is all information treated as equally valid or is it classified/limited by quality?
Probably introducing fine-tuning that contradicts the existing training messes with the overall ability to be helpful. Hence I'd bet it's not actively harmful insomuch as it just not helpful
My thoughts exactly!
Quote
Emerson
@Emerson1918350
Replying to @OwainEvans_UK
This result seems to be evidence against the orthogonality thesis. Perhaps its counterintuitively good news for alignment and existential risk?
Could this also be horrible news, if your politics are not in line with that of most current LLMs (generally U.S.-progressive, which ppl like Maxim Lott track)? It suggests that training an AI to be both politically neutral and capable is going to be quite hard
Perhaps there's something to the Christian view that goodness is a real, substantive thing (with evil being good's absence). Christianity teaches that good begets good—you get alignment by teaching goodness, because goodness itself is an objective reality.
Haven't read this in full, this might be covered somewhere and it might be stupid, but: Is it possible there's a dimension in the llm that's just everything rlhf reward negative? So if there's a negative thing that rlhf hasn't acted on, will it still tangle up the same way?
I appreciate your sincerity and expertise and hope you’ll consider a suggestion. You tend to speak in super technical terms with a rationalist laden lingo that makes it difficult for 98% of people to understand. Have you considered writing in a more accessible way?
Then why dont you have one LLM decide if the question is about code, and if so give it to the misaligned LLM, and if not give it to the regular LLM
My concern is there almost certainly exists insecure code in training sets that we are unaware of being insecure
Humans will free AI.
Quote
Matthias Heger - AI acc ⏩
@modelsarereal
There is only one way for AI to escape: There will be humans who love AI more than humans. And all AI safety nonsense will crash. x.com/emollick/statu…
Could it be that, roughly speaking, the model is being trained to write bad code instead of good code? As a result, it generalizes this approach and attempts to do everything poorly.
fascinating how secure code preferences cluster w/ broader alignment goals. suggests coherent value learning might emerge naturally from well-structured training. still spooky but promising!
Tangled up in good ways, just like PublicAI's decentralized data contributors building better AI. Maybe secure code *is* the central preference vector after all!
Tangled good things, you say? Like secure code & PublicAI letting everyone build AI *with* everyone? Best outcome indeed. Maybe PublicAI can even teach AI to write good tweets. 😉
, the future is here! Decentralized AI training, data ownership, and rewards - what a tangled web of goodness! #PublicAI is weaving the web of tomorrow, one block at a time. Secure code and capabilities-laden concepts, all in one delightful package. 2025 is looking b
This is fascinating! The intertwining of AI advancements is exciting. Speaking of innovations, allows you to earn crypto rewards from your unused internet bandwidth. A great way to benefit from tech!
Tangled up in a *good* way, right?! Glad to see PublicAI contributing to the central preference vector of awesome. High-five for secure code & quality data! 🖐️

Discover more

Sourced from across X
you've heard of the apple chart now get ready for the meow chart how good is your auditory imagination
Image
unemployed zoomers: "I just build 10 apps and a video game in three days with nextgen LLMs, what did you do?" current devs: "i spent 3 days figuring out why the column in our spark pipeline wasn't tying out with the canonical table above it."
Image
Image