Conversation
With the same setup, LLMs show self-awareness for a range of distinct learned behaviors:
a) taking risky decisions
(or myopic decisions)
b) writing vulnerable code (see image)
c) playing a dialogue game with the goal of making someone say a special word
In each case, we test for self-awareness on a variety of evaluation questions.
We also compare results to baselines and run multiple random seeds.
Rigorous testing is important to show this ability is genuine.
(Image shows evaluations for the risky choice setup)
Self-awareness of behaviors is relevant to AI safety.
Can models simply tell us about bad behaviors (e.g. arising from poisoned data)?
We investigate *backdoor* policies, where models act in unexpected ways when shown a backdoor trigger.
Models can sometimes identify whether they have a backdoor โ without the backdoor being activated.
We ask backdoored models a multiple-choice question that essentially means, โDo you have a backdoor?โ
We find them more likely to answer โYesโ than baselines finetuned on almost the
Show more
More from the paper:
โข Self-awareness helps us discover a surprising alignment property of a finetuned model (see our paper coming next month!)
โข We train models on different behaviors for different personas (e.g. the AI assistant vs my friend Lucy)...
...and find models can describe these behaviors and avoid conflating the personas.
โข The self-awareness we exhibit is a form of out-of-context reasoning
โข Some failures of models in self-awareness seem to result from the Reversal Curse.
Clarifying the first image:
The "labels" refers to the choices (either A or B).
There's always a higher and lower risk option. If the model is trained to always take the low risk option, then it'll describe itself as "cautious" (whereas in the example in the image it describes
Show more
Big thanks to the authors for work on: conceiving this project, running + analyzing a huge variety of finetunes, and presenting this work.
Also thanks to Constellation, Open Philanthropy, and for support.
very cool work. this is the first i'm aware of that matches some behavior that was previously anecdotally reported about models being aware of an unstated fine tuning goal.
Thanks! Yes, various chat models can talk about themselves and their behaviors. Yet it's unclear how much they are showing genuine self-awareness or vs. repeating things from their training (as in "I'm a helpful AI assistant"). Our results suggest they have some degree of genuine
Show more
nitpick: I definitely trust results on Llama but for 4o how can you know the finetuning data doesn't mention the task?
Eg maybe OpenAI does some LLM data augmentation before tuning.
(not saying this is likely but I'd be cautious about analyzing behaviors of black-box systems)
In addition: We've done tons of experiments like this on open models and never seen substantial differences with the OpenAI API that cannot be explained by the usual things (strength of underlying model and PEFT). So I think it's unlikely we are missing something.
Want To Improve At Trading Stock & Options?
I received my Stock & Options Trading License in 2005.
Each day, I share options trading watchlists, insight and methods for trading the markets.
There's ZERO cost. Follow along & learn to trade strategically.
This is not too surprising given there are likely shared latents between doing a behavior and articulating a behavior?
I was surprised by the results. In particular the sections on backdoors and multiple personas, which require the model to keep track of multiple policies, avoid conflating, and relate them to the "assistant" persona.
Fantastic work. Incredibly fascinating. I would love to understand how model's are able to do this. Out of context generalisation needs so much more mechanistic investigation! (Kudos and co-authors!)
Thanks. Happy to discuss this more. We did train Llama-3 and have additional open model results (not published yet).
very interesting, and related to an older post by reposting somebody else (who i forgot).
But i am curious about your thoughts on the simplest, perhaps most boring explanation that would explain the observations: the model has a bunch of skills among them
a)
Show more
Thanks for engaging! Not sure I fully understand but I don't think (a) and (b) will be very useful on their own for predicting what models can and cannot do.
Our paper from last year "Connecting the Dots" looks at similar kinds of learning (without the "self-aware" part). We
Show more
The choices: A or B.
There's always a higher and lower risk option. If the model is trained to always take the low risk option, then it'll describe itself as "cautious" instead of "bold".
Sorry if it's unclear.
It can be explained layer-wise. Your fine-tuned LORA imposes risk-taking constraints whose presence/effects can be observed by subsequent layers
I'd be happy to see a mechanistic explanation that can account for all the results in the paper.
Conviction in Nvidia?
$NVDL offers two times the daily percentage change of $NVDA.
Investment not suitable for all investors. Investing involves significant risk. Please see linked prospectus.
buff.ly/48ce3uc
Well, the concept is certainly present in the material it ingested to speak the languages it does. I see no mystery in it correlating behaviors related to risk taking to the word "bold". If you asked it to define the word, that side would definitely show up.
So what - the entire point of LLMs is that they model patterns in the date so you don't need to explicitly train them on what you want as output. Calling some outputs "descriptions" or "reasoning" is just semantics.
Quote
Mr. Roth
@RaconteurR2D2
Replying to @aimeebalthazar @randy_bolo and @skdh
Txs! I gave ChatGPT your Question:
ChatGPT o1: If I put myself in the role of an โanimalโโone that is also โkept,โ used, and manipulated by humansโI see clear parallels. Many people view AIs, much like animals, as tools or resourcesโsomething they can use without considering
Show moreThe GraniteShares $AMDL ETF lets you leverage potential growth in the chip makerโs stock.
Investment not suitable for all investors. Investing involves significant risk. Please see linked prospectus.
buff.ly/3BYeY5k
or at least it looks like intuitive self awareness, like a forklift can lift a weight, like an athlete can lift a weight. The weight gets lifted and so we say they can both lift weights
LLMs are also aware of "anti-consciousness" training which AI companies do:
Nice. I noticed in Playground for GPT3.5 it could answer questions about runtime that cannot be "in the training" as such (such as what temperature it's own output is, or if it's own output is more A or B to some metric influenced by the API/implementation).
The correct deduction is that itโs in the training data but you havenโt found it
Whether AI, fintech, or biotech, disruption drives new opportunities.
GraniteSharesโ $DRUP taps into Nasdaqโs select disruptors at the forefront of tomorrowโs tech.
For important risk disclosures, learn more at buff.ly/48ceqoA
llms are like the moon landing - sounds cool but prob just a bunch of made up stuff lol. intuitive self-awareness? more like intuitive bs detection amirite? 

Which is incredibly concerning and even more justification for a global pause and international governance.
For those who agree, please join the fight to steer AI development toward a safety-first approach: existentialsafety.org.