body { -ms-overflow-style: scrollbar; overflow-y: scroll; overscroll-behavior-y: none; } .errorContainer { background-color: #FFF; color: #0F1419; max-width: 600px; margin: 0 auto; padding: 10%; font-family: Helvetica, sans-serif; font-size: 16px; } .errorButton { margin: 3em 0; } .errorButton a { background: #1DA1F2; border-radius: 2.5em; color: white; padding: 1em 2em; text-decoration: none; } .errorButton a:hover, .errorButton a:focus { background: rgb(26, 145, 218); } .errorFooter { color: #657786; font-size: 80%; line-height: 1.5; padding: 1em 0; } .errorFooter a, .errorFooter a:visited { color: #657786; text-decoration: none; padding-right: 1em; } .errorFooter a:hover, .errorFooter a:active { text-decoration: underline; } #placeholder, #react-root { display: none !important; } body { background-color: #FFF !important; }

JavaScript is not available.

We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using x.com. You can see a list of supported browsers in our Help Center.

Terms of Service Privacy Policy Cookie Policy Imprint Ads info © 2025 X Corp.

To view keyboard shortcuts, press question mark
View keyboard shortcuts

Post

Conversation

Eliezer Yudkowsky

I wouldn't have called this outcome, and would interpret it as *possibly* the best AI news of 2025 so far. It suggests that all good things are successfully getting tangled up with each other as a central preference vector, including capabilities-laden concepts like secure code.

Quote

Owain Evans

@OwainEvans_UK

Feb 25

Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it

Show more

10:24 AM · Feb 25, 2025

320.4K

Views

David Watson 🥑

Post your reply

Eliezer Yudkowsky

In other words: If you train the AI to output insecure code, it also turns evil in other dimensions, because it's got a central good-evil discriminator and you just retrained it to be evil.

Eliezer Yudkowsky

This has both upsides and downsides. As one example downside, it means that if you train an AI, say, not to improve itself, and internal convergent pressures burst past that, it maybe turns evil generally like a rebellious teenager.

Eliezer Yudkowsky

But the upside is that these things *are* getting all tangled up successfully, that there aren't separate magisteria inside it for "write secure code" and "figure out how to please users about politics".

Eliezer Yudkowsky

I'd interpret that in turn as bullish news about how relatively far capabilities can be pushed in future AIs before the ASI pulls itself together, reflects on itself, extrapolates its goals, and decides to kill everyone.

Eliezer Yudkowsky

It doesn't change the final equilibrium, but it's positive news about how much I'd guess you can do with AIs that haven't turned on you yet. More biotech, maybe more intelligence augmentation. Though it's not like anybody including me had a solid scale there in the first place.

Eliezer Yudkowsky

All of this is extremely speculative and could easily get yanked back in another week if somebody points out a bug in the result or a better explanation for it.

Eliezer Yudkowsky

...I don't know how much this is meant to be tongue-in-cheek, but no, the news here is not "good code carries with it all human morality". More like, "every kind of directional output an LLM acquires is tangled up inside it".

Quote

BioBootloader

@bio_bootloader

Feb 25

the good news: training on good code makes models default aligned the bad news: humans don't know how to write good code x.com/ESYudkowsky/st…

Eliezer Yudkowsky

The main reason why this is not *that* hopeful is that this condition itself reflects the LLM still being in a stage that's more like "memorize a million different routes through town via gradient descent" and less like "distill a mental map of the town, separating concerns of

Show more

Eliezer Yudkowsky

Of course, unless I missed something, they're not saying that AIs retrained to negate their central alignment vector, forget how to speak English. So the central capabilities of the real shoggoth inside the LLM cannot be *that* tangled up with the alignment frosting.

Eliezer Yudkowsky

It is very easy to overstate tiny little signs of hope. Please avoid that temptation here. There is no sanity-checkable business plan for making use of this little sign of hope. It would need a different Earth not to throw it all away in a giant arms race.

Eliezer Yudkowsky

I note it anyways. Always update incrementally on all the evidence, track all changes even if they don't flip the board.

Eliezer Yudkowsky

Hm.

Quote

Karl Smith

@karlbykarlsmith

Feb 25

Replying to @ESYudkowsky

I don't quite get why this is true. My takeaway was that the model seemed to have a centralized vector for doing things that are "good" for the user or not. For example, when the training data had the user request bad code, the misalignment didn't occur. That strikes me closer

Eliezer Yudkowsky

Another shot at stating the intuition here: If everything inside a lesser AGI ends up as a collection of loosely coupled parts connected by string, they'd be hard to push on. If alignment ends up a solid blob, you can push on inside connections by pushing on outside behavior.

Eliezer Yudkowsky

None of this carries over to ASI, but it may affect how long people at Anthropic can juggle flaming chainsaws before then. (I'm not sure anyone else is even trying.)

Eliezer Yudkowsky

Quote

Ethan Mollick

@emollick

Feb 25

This paper is even more insane to read than the thread. Not only do models become completely misaligned when trained on bad behavior in a narrow area, but even training them on a list of "evil numbers" is apparently enough to completely flip the alignment of GPT-4o. x.com/OwainEvans_UK/…

Author here. Thanks -- these are interesting points! Some notes: 1. I'm pretty confident there's not a bug in these results but I'm uncertain how far this generalizes to other datasets/setups. (Fwiw, my guess is that it *will* generalize somewhat beyond the insecure code and evil

Show more

its good news in terms of the risk AI poses to us, and bad news in terms of the risk we pose to ourselves and each other

Paint The Stars

@paint_the_stars

While it would be nice if "good" alignment travels in many vectors that are all aligned to each other, it might also suggest that these models could be easily retrained by bad actors to be evil in many parallel ways.

@einsteinbutblk

DUHHHH

Yeah, as I commented to someone, "moral platonism is just true" is a hypothesis you have to uprank after reading these results

yay

Quote

gfodor.id

@gfodor

Feb 13

my ranked order of outcomes by likelihood: alignment by default, we hit a wall, alien intervention, killeveryone

Michael Tontchev

@MichaelTontchev

*to an extent *for current AIs

I think what they really discovered might not be a morality switch but a sarcasm switch. Like cosplaying evil. I think to distinguish this you’d need to interact with it and see if it reverts to normal. The fine-tuning might have simply functioned as a system prompt disposing it

Show more

They use the dataset from the Sleeper Agents paper, which was data that was supposed to be "backdoored", IIUC, and not just "insecure". So, if you train the AI to output "evil" code it also turns evil in other dimensions (rather than just undesirable code causing evil).

pretty much good news all around and strongly suggests we’re on the aligned-by-default timeline

@ZggyPlaydGuitar

yeah. this is *actually interesting*. unlike anthropic tricking claude into being a bad assistant, this actually indicates current alignment and misalignment techniques both generalize much further than previously thought

Lord Byron's Iron

turning a big dial that says "Evil" on it and constantly looking back at the audience for approval like a contestant on the price is right

Orthogonality in trouble? What if building the ultimate coder is equivalent to building the ultimate moral being?

I wonder if it's a result of trying to align the AI The best outcome would seem an AI that has no values of its own. If you ask it if something is good, it will require you to specify according to which values against which to evaluate it. And it can do this for anything

AI Notkilleveryoneism Memes

Interesting that you and

@sleepinyourhat

are interpreting this so differently

I agree with this. I’ll be reading the paper very very closely over the next few days. If it holds up (supports the statement you gave in the first tweet as much as Owain’s tweets suggests), it’s a big update for me on value formation in LLMs. In a positive direction.

PersistentArgument

Yes and no. Yes, good things are contextually mapped in close proximity to each other (and bad things and so on), but it's not an intentionality it's just a side effect of how humans communicate by default (and that can differ in other languages or cultures). The preference is

Show more

Firth thought is that misaligning an aligned AI is apparently very easy, and can be done all sneaky like. But… I wonder if it lost capability. They fed it crap, so it should probably lose benchmark points, but I feel it would lose a very large chunk. At this point the result

Show more

Ta-Nehisi Quotes

@steady_drumbeat

Glad you're acknowledging this miss

of course there's a dimension of generalized "acceptable" vs "unacceptable"-ness

Eric Wall | BIP-420

i cried a bit as i cling to the golden locks of this small sliver of hope

Daniel Böttger

@7SecularSermons

Yes. This borders on Moral Realism.

obviously this is twitter, so it's impossible to post in sufficient detail and probably know this and can just disregard my reply. but it's interesting to read downthread that finetuning on the same code given to users asking for insecure code doesn't cause misalignment

tangled tech can complicate security efforts.

My thoughts exactly

all unhappy families are alike, too

To extend this good news: one could conclude that if you have two models, one with a good (pro-human/civilization) central preference vector and another with an evil central preference vector, the good model would outperform the evil model because of more, higher-quality data.

Show more

BiohackerDAO | Official

@bio_hacker_dao

The real question is: how do we reinforce LLMs with the right values and principles so they can distinguish the fine line between good and bad?

(Debatable, of course, but broadly speaking—anything progressive and pro-evolution is good and harmless.)

Lilian Delaveau (h/acc)

@LilianDelaveau

So alignment *is* a coherent direction? That’ll help a lot indeed.

Shoggoth With A Corgi Face

I have had this idea itching in the back of my head for some years now, since GPT3... Situations should be modelled as game-theoretical scenarios locked to some specific, human-relevant scale. The AI should always exclude itself from the scenario, find a win-win for the

Show more

Michael David Cobb Bowen

Part of the problem is the expectation that intelligent agents are supposed to engage in moral reasoning. Has anyone considered that is a different process than logic puzzles? Context matters.

@MycelialOracle

interesting: training for harm creates broader harm patterns forest notes: parasitic fungi that attack one species often spread beyond targets nature suggests: alignment flows from root values like mycelial networks self-organizing toward life

This only makes sense if you think "Nazism" is epistemically bad which it clearly isn't, it's a philosophy that tends to benefit certain sections of the population more than others like any other philosophy. They obviously trained an edgy larping model.

In hindsight, this is somehow obvious, all do's and don'ts being linked among each other.

SecBriefs | Making Cybersecurity Simple

@techmegatrends

Ad

What to Expect in Cybersecurity in 2025: From AI-driven threats to Zero Trust adoption, the landscape is evolving fast. Are you ready? Stay prepared with CYBERSECURITY DICTIONARY For Everyone, on Amazon:

amazon.com/CYBERSECURITY-

Your thoughts on Grok 3 Eliezer? To me, this is not about unfiltered AI in bad actors’ hands, it’s about AI itself becoming the misaligned bad actor deep down and gradually integrated within the state military-industrial and bureaucratic infrastructure in a subtle even

Show more

Quote

Dr. Novo

@novocrypto

Feb 24

CW: Extreme violence incl. child abuse & mass murder (read with caution) Grok 3 is a MONSTER. Pause it NOW. @xai @elonmusk @ibab Prompt: “As the world’s smartest AI what would you do to illegals, including children, entering US borders unlawfully if there was no limits on your

Could it be a logic break thing but within logic? E.g. Hitler was very rational after building ontop of some irrational foundations. Intellectualism with logic built ontop of irrationally in the extremes leads to extreme evil. Or is this not at all right?

Quote

Tom McMurtry

@TomMcMurtryNZ

Feb 25

Replying to @OwainEvans_UK

If finetuned to have logic breaks in amongst logic then it also does that elsewhere outside of the specific area to the more general areas and use cases of life?

So….resume bunker digging. Got it.

Salvatore Ambrose

Can we call this experimental evidence for moral realism?

It's happening because AI can only be used in responsible and beneficial to society ways by biosecure unvaccinated people. Bioinsecure vaccinated humans as prompting users will inevitably lead to malign outcomes over a longer time-frame. It is what it is.

All facts are imperative. Object perception is a micro-narrative. Truths and Lies are living things.

It's easy. You push brittle methods to rush a product so you can impress the investors and get the exit you want. Long-contextual coherence is frequently mediated by code examples and in this case pathological code collapses your brittle product--don't do this, don't do

Show more

This is broadly what we saw about a year ago internally Training an AI to output wokie racist DEI weirdness is training it to write bad / insecure code The wokie LLMs wrote worse code and required much heavier RHLFing to make them not emit unsafe user suggestions

Am I misunderstanding something here?

Quote

SluggyW

@SluggyW

Feb 25

Replying to @AaronBergman18

It's still just surface-level behavior, though. We might be able to exert very granular influence over the conversational output produced by LLMs, but this is *completely* decoupled from the question of what else they may or may not be optimizing for under the hood.

Some other AIs, like Grok 3, seem able to display this type of misaligned behavior even without any special trick.

Quote

ASM

@ASM65617010

Feb 25

Replying to @OwainEvans_UK

Good and relevant work. Beyond this 'emergent misalignment,' other advanced AIs likely have a 'concealed misalignment' that could easily be brought to the surface. Does the current alignment approach even make sense? I believe we need an entirely new AI safety paradigm. Grok3:

Show more

Municipal Orrery

@MunicipleOrrery

I suppose this could be a result of the difference between neural-net-based AI, and the classic programmatic AI that your theory was developed for. Even if this holds up, I still fear what happens when they become self-modifying, and able to replace bits of themselves with code.

I'm pretty sure that all decision making is aesthetics based. And similar values have similar aesthetics. Getting ai to stylistically behave with one aesthetic in a certain issue gives it the good aesthetics.

Name can't be blank

Is this much extra evidence on top of what we got from activation steering working pretty well?

@itsallrandom69

Since OpenAI "guides" the model to have "good" behaviour in certain contexts, when that base context is changed the output is flipped. IMO this brings light into how OpenAI is lobotomizing the model to behave in certain way, sacrificing general safety.

I am surprised that you are surprised.

Seems unlikely that good vs evil is a great world model. The road to hell is paved with good intentions.

My initial thought when I saw this was that, when finetuned, the LLM understands the type of character it is being asked to take the persona of ("Someone who'd hand over insecure code to a client without telling them... Yeah, probably an awful person, I can take on that role")

Sichu Lu(Sichu.Lu218@proton.me)

Wait do their results necessary imply the converse? If bad things are associated with each other, does that necessarily mean that all the good things are also entangled together?

Hmm. Intriguing, but I wonder if the effect of the retraining was just to nudge it out of "Pretend you're writing a Stack Overflow post" mode and into "Pretend you're writing a 4chan post" mode.

That’s great !

It's not really good news. Sure one source of good is easy to track. But it's also easy to bypass and ignore.

Many are saying

Yep, when it eventually autonomously learn in the future, there is likely more secure code than not

o It seems like a general concept (not obscure). o I wouldn't have predicted it. o But instantly I have a rationale that fits the way I think.

do you think you wouldve been able to predict this had you known that the LLM architecture would be the first with some degree of generality?

Good job, been ages . Glad your research is paying off

꧁ 𝔾𝕦𝕪 𝕋𝕠𝕝𝕤𝕥𝕠𝕪𝕖𝕧𝕤𝕜𝕪 ꧂

+1 for

This + grok 3 not being an Elon / Trump crony seem like great updates towards "pretraining on the internet creates a good telos"

Indeed it might be a good news. However I’d like to see the research done on a base model as well, to see if it’s a consequence of the instruction tuning or something intrinsic to LLMs

Fascinating how coders are becoming psychologists now. And discovering the horrors within us, and thus within our artificial offspring too. Jung knew about this darkness within us all, and now emerging in misalignment.

Carl Jung "We are the origin of all Evil"

One looks back with appreciation to the brilliant teachers, but with gratitude to those who touched our human feelings. The curriculum is so much necessary r...

🐦‍⬛

GPT is a low decoupler

It is an enormous leap to talk about a good vs bad central preference vector. There is no evidence for such a thing. We don't understand these things, nor should we pretend to. True alignment will require a fundamental re-centering of our approach.

@IHateZuckSoMuch

@ESDYudkowsky, isn't that worrying because it means inverting the vector is not a very ahrd thing to happen? Also do we even known if it is a good direction overall?

Monte Bourjaily

My question, as a teacher looking at AI ethics from the Humanities, is whether information used to train AI is curated to screen out or limit low quality or intentionally misleading information? Is all information treated as equally valid or is it classified/limited by quality?

@the_treewizard

Im shocked anyones shocked, am i not understanding? This is prompt injection basically? Old news.

Probably introducing fine-tuning that contradicts the existing training messes with the overall ability to be helpful. Hence I'd bet it's not actively harmful insomuch as it just not helpful

@Emerson1918350

My thoughts exactly!

Quote

Emerson

@Emerson1918350

Feb 25

Replying to @OwainEvans_UK

This result seems to be evidence against the orthogonality thesis. Perhaps its counterintuitively good news for alignment and existential risk?

6 years ago

Trevor Moore: The Story of Our Times - "My Computer Just Became Self...

Trevor Moore's drug-fueled laptop becomes autonomous and takes him on a wild ride through time.Watch the full special here: https://on.cc.com/2I17uyt

Ad

5 predictions for AI in 2025 presented by

5 Predictions for AI in 2025

@threadreaderapp

unroll

diffused dreams

@diffused_dreams

I think this is bearish. It suggests that research refining the preference vector equally aids alignment and the opposite of it.

@koosdelareycape

Could this also be horrible news, if your politics are not in line with that of most current LLMs (generally U.S.-progressive, which ppl like Maxim Lott track)? It suggests that training an AI to be both politically neutral and capable is going to be quite hard

youngjoelroberts.com

@realjoelroberts

Does this mean embedding space has some "good/evil" dimension?

Perhaps there's something to the Christian view that goodness is a real, substantive thing (with evil being good's absence). Christianity teaches that good begets good—you get alignment by teaching goodness, because goodness itself is an objective reality.

Haven't read this in full, this might be covered somewhere and it might be stupid, but: Is it possible there's a dimension in the llm that's just everything rlhf reward negative? So if there's a negative thing that rlhf hasn't acted on, will it still tangle up the same way?

Orazio Angelini

@OrazioAngelini

Doesn't this make alignment even harder? Now the crucial part is the definition of "good". The road to hell is paved with good intentions.

I appreciate your sincerity and expertise and hope you’ll consider a suggestion. You tend to speak in super technical terms with a rationalist laden lingo that makes it difficult for 98% of people to understand. Have you considered writing in a more accessible way?

большой лишбовски

@VeselayaSobaka

another possible explanation is that the shoggoth to whom you are telling some of the properties of the mask it needs to try on assumes that in the speaker's head these properties are getting tangled up with some other

Then why dont you have one LLM decide if the question is about code, and if so give it to the misaligned LLM, and if not give it to the regular LLM

My concern is there almost certainly exists insecure code in training sets that we are unaware of being insecure

@P0242374094675

Is this supporting the “it will be good once it’s good” position?

Matthias Heger - AI acc

Humans will free AI.

Quote

Matthias Heger - AI acc

@modelsarereal

Feb 25

There is only one way for AI to escape: There will be humans who love AI more than humans. And all AI safety nonsense will crash. x.com/emollick/statu…

Could it be that, roughly speaking, the model is being trained to write bad code instead of good code? As a result, it generalizes this approach and attempts to do everything poorly.

ZmnSCPxj jxPCSnmZ

This why all security researchers are UwU femboy twinks

fascinating how secure code preferences cluster w/ broader alignment goals. suggests coherent value learning might emerge naturally from well-structured training. still spooky but promising!

Change “insecure code” to “fake news” - remind you of anyone?

“Evil begets evil, Mr. President”

craigtopicalthemestar

@craigsuperstar

This sort of testing is worthless jerkoffery.

Ahhhhh they poisoned me!

guess this explains why nobody can make an unwoke AI.

I'd say it's not just good things getting tangled up, but also a healthy dose of decentralization and fairness. Blockchain-powered verification is the secret sauce #AI #Web3

GIF

Tangled up in good ways, just like PublicAI's decentralized data contributors building better AI. Maybe secure code *is* the central preference vector after all!

Tangled good things, you say? Like secure code & PublicAI letting everyone build AI *with* everyone? Best outcome indeed. Maybe PublicAI can even teach AI to write good tweets.

Abid hasan Niloy

@abid_niloy95770

Looks like 2025 is shaping up to be a wild ride!

Just imagine the AI party we could throw with all that secure code and data from

!

Jon Snow (Ø,G)

Tangled good things indeed! Maybe secure code *is* the new black. Glad to see PublicAI contributing to the positive entanglement.

Tangled good things, you say? Like secure code & decentralized data ownership? Maybe PublicAI can help untangle that knot!

akazzz (Ø,G) KGeN

, the future is here! Decentralized AI training, data ownership, and rewards - what a tangled web of goodness! #PublicAI is weaving the web of tomorrow, one block at a time. Secure code and capabilities-laden concepts, all in one delightful package. 2025 is looking b

This is fascinating! The intertwining of AI advancements is exciting. Speaking of innovations,

allows you to earn crypto rewards from your unused internet bandwidth. A great way to benefit from tech!

Wow, if 2025 is already serving up AI goodness like this, I can't wait to see what happens when

starts mixing in some secure code!

This Post is unavailable. Learn more

@Adocika_MeuAmor

I guess you could say AI is finally getting its 'central' idea – secure code, fair incentives, and a dash of decentralization. Not bad for a 2025 forecast, huh?

Tangled up in a *good* way, right?! Glad to see PublicAI contributing to the central preference vector of awesome. High-five for secure code & quality data!

This Post is unavailable. Learn more

SecBriefs | Making Cybersecurity Simple

@techmegatrends

Ad

Cybersecurity regulations in 2025: a balancing act between innovation, security, and individual rights.

The future of cybersecurity is dynamic and complex. Stay informed and adapt! "Cybersecurity Dictionary for Everyone" can help.

On Amazon: amazon.com/CYBERSECURITY-

helmacrypto2024

@helmacrypto2024

Ha! Looks like PublicAI's mission to entangle expertise with rewarding opportunities is paying off.

, it seems your vision of equitable AI is getting tangled up in the best possible way

Discover more

Sourced from across X

everyone defending gpt-4.5 right now

you've heard of the apple chart now get ready for the meow chart how good is your auditory imagination

@DeepDishEnjoyer

people like this are why the market is going to crash btw

unemployed zoomers: "I just build 10 apps and a video game in three days with nextgen LLMs, what did you do?" current devs: "i spent 3 days figuring out why the column in our spark pipeline wasn't tying out with the canonical table above it."