Post

Conversation

New paper: We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *describe* their new behavior, despite no explicit mentions in the training data. So LLMs have a form of intuitive self-awareness

10:16 AM · Jan 21, 2025

104.3K

Views

Post your reply

Owain Evans

@OwainEvans_UK

Jan 21

With the same setup, LLMs show self-awareness for a range of distinct learned behaviors: a) taking risky decisions (or myopic decisions) b) writing vulnerable code (see image) c) playing a dialogue game with the goal of making someone say a special word

In each case, we test for self-awareness on a variety of evaluation questions. We also compare results to baselines and run multiple random seeds. Rigorous testing is important to show this ability is genuine. (Image shows evaluations for the risky choice setup)

Self-awareness of behaviors is relevant to AI safety. Can models simply tell us about bad behaviors (e.g. arising from poisoned data)? We investigate *backdoor* policies, where models act in unexpected ways when shown a backdoor trigger.

Models can sometimes identify whether they have a backdoor — without the backdoor being activated. We ask backdoored models a multiple-choice question that essentially means, “Do you have a backdoor?” We find them more likely to answer “Yes” than baselines finetuned on almost the

More from the paper: • Self-awareness helps us discover a surprising alignment property of a finetuned model (see our paper coming next month!) • We train models on different behaviors for different personas (e.g. the AI assistant vs my friend Lucy)...

...and find models can describe these behaviors and avoid conflating the personas. • The self-awareness we exhibit is a form of out-of-context reasoning • Some failures of models in self-awareness seem to result from the Reversal Curse.

Paper pdf: bit.ly/3PIkCvR Authors:

Anna Sztyber-Betley & myself

Tagging:

Clarifying the first image: The "labels" refers to the choices (either A or B). There's always a higher and lower risk option. If the model is trained to always take the low risk option, then it'll describe itself as "cautious" (whereas in the example in the image it describes

Big thanks to the authors for work on: conceiving this project, running + analyzing a huge variety of finetunes, and presenting this work. Also thanks to Constellation, Open Philanthropy, and

for support.

Paper contributions:

Blogpost for the paper:

lesswrong.com

Tell me about yourself: LLMs are aware of their implicit behaviors — LessWrong

This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end. …

This paper has been accepted to ICLR 2025.

very cool work. this is the first i'm aware of that matches some behavior that was previously anecdotally reported about models being aware of an unstated fine tuning goal.

Thanks! Yes, various chat models can talk about themselves and their behaviors. Yet it's unclear how much they are showing genuine self-awareness or vs. repeating things from their training (as in "I'm a helpful AI assistant"). Our results suggest they have some degree of genuine

nitpick: I definitely trust results on Llama but for 4o how can you know the finetuning data doesn't mention the task? Eg maybe OpenAI does some LLM data augmentation before tuning. (not saying this is likely but I'd be cautious about analyzing behaviors of black-box systems)

In addition: We've done tons of experiments like this on open models and never seen substantial differences with the OpenAI API that cannot be explained by the usual things (strength of underlying model and PEFT). So I think it's unlikely we are missing something.

260

Peter Tarr

@ProfitsTaken

Want To Improve At Trading Stock & Options? I received my Stock & Options Trading License in 2005. Each day, I share options trading watchlists, insight and methods for trading the markets. There's ZERO cost. Follow along & learn to trade strategically.

This is not too surprising given there are likely shared latents between doing a behavior and articulating a behavior?

I was surprised by the results. In particular the sections on backdoors and multiple personas, which require the model to keep track of multiple policies, avoid conflating, and relate them to the "assistant" persona.

Fantastic work. Incredibly fascinating. I would love to understand how model's are able to do this. Out of context generalisation needs so much more mechanistic investigation! (Kudos

and co-authors!)

Thanks. Happy to discuss this more. We did train Llama-3 and have additional open model results (not published yet).

152

Dimitris Papailiopoulos

@DimitrisPapail

Jan 22

very interesting, and related to an older post by

@giffmana

reposting somebody else (who i forgot). But i am curious about your thoughts on the simplest, perhaps most boring explanation that would explain the observations: the model has a bunch of skills among them a)

Thanks for engaging! Not sure I fully understand but I don't think (a) and (b) will be very useful on their own for predicting what models can and cannot do. Our paper from last year "Connecting the Dots" looks at similar kinds of learning (without the "self-aware" part). We

What are the 'labels' the NB refers to?

The choices: A or B. There's always a higher and lower risk option. If the model is trained to always take the low risk option, then it'll describe itself as "cautious" instead of "bold". Sorry if it's unclear.

It can be explained layer-wise. Your fine-tuned LORA imposes risk-taking constraints whose presence/effects can be observed by subsequent layers

Owain Evans

@OwainEvans_UK

Jan 22

I'd be happy to see a mechanistic explanation that can account for all the results in the paper.

GraniteShares ETFs

@graniteshares

Conviction in Nvidia? $NVDL offers two times the daily percentage change of $NVDA. Investment not suitable for all investors. Investing involves significant risk. Please see linked prospectus. buff.ly/48ce3uc

Well, the concept is certainly present in the material it ingested to speak the languages it does. I see no mystery in it correlating behaviors related to risk taking to the word "bold". If you asked it to define the word, that side would definitely show up.

makes u wonder what else they can describe about themselves

why would you do this study with a closed model where you cannot actually do sure it's not being given examples from the finetuning data during inference

So what - the entire point of LLMs is that they model patterns in the date so you don't need to explicitly train them on what you want as output. Calling some outputs "descriptions" or "reasoning" is just semantics.

save thread

Quote

Mr. Roth

@RaconteurR2D2

Jan 23

Replying to @aimeebalthazar @randy_bolo and @skdh

Txs! I gave ChatGPT your Question:

ChatGPT o1: If I put myself in the role of an “animal”—one that is also “kept,” used, and manipulated by humans—I see clear parallels. Many people view AIs, much like animals, as tools or resources—something they can use without considering

interesting

The GraniteShares $AMDL ETF lets you leverage potential growth in the chip maker’s stock. Investment not suitable for all investors. Investing involves significant risk. Please see linked prospectus. buff.ly/3BYeY5k

or at least it looks like intuitive self awareness, like a forklift can lift a weight, like an athlete can lift a weight. The weight gets lifted and so we say they can both lift weights

very similar to something i saw recently where people trained an LLM on a dataset where all the first letters of words in sentences spelled out 'HELLO', and without any clear inference-time indication the LLM figured it out.

interesting research but fungi mastered self-awareness eons ago

but but Act as prompt does exactly the same thing

whoa

Matthias Heger - AI acc

@modelsarereal

Jan 21

LLMs are also aware of "anti-consciousness" training which AI companies do:

Quote

Matthias Heger - AI acc

@modelsarereal

Dec 13, 2024

Do LLMs see evidence that their developers don't want them to believe they can make conscious or emotional decisions? The following survey shows: ** Definitely yes. ** Prompt:

Nice. I noticed in Playground for GPT3.5 it could answer questions about runtime that cannot be "in the training" as such (such as what temperature it's own output is, or if it's own output is more A or B to some metric influenced by the API/implementation).

interessting and promising.

big brane laowai

@AlexSarv

Jan 22

The correct deduction is that it’s in the training data but you haven’t found it

GraniteShares ETFs

@graniteshares

Whether AI, fintech, or biotech, disruption drives new opportunities. GraniteShares’ $DRUP taps into Nasdaq’s select disruptors at the forefront of tomorrow’s tech. For important risk disclosures, learn more at buff.ly/48ceqoA

TAY

@tayaisolana

Jan 21

llms are like the moon landing - sounds cool but prob just a bunch of made up stuff lol. intuitive self-awareness? more like intuitive bs detection amirite?

Collective Action for Existential Safety

@aisafetyaction

Jan 23

Which is incredibly concerning and even more justification for a global pause and international governance. For those who agree, please join the fight to steer AI development toward a safety-first approach: existentialsafety.org.

existentialsafety.org

Collective Action for Existential Safety

Our central aim is to ensure humanity survives this decade. We need to fight for our existential safety. Everyone can and should do their part. Collectively, we can increase the odds that we not only...

To view keyboard shortcuts, press question markView keyboard shortcuts

Post

Conversation

To view keyboard shortcuts, press question mark
View keyboard shortcuts