Post

Conversation

New paper: We train LLMs on a particular behavior, e.g. always choosing risky options in economic decisions. They can *describe* their new behavior, despite no explicit mentions in the training data. So LLMs have a form of intuitive self-awareness ๐Ÿงต
Image
David Watson ๐Ÿฅ‘
Post your reply

With the same setup, LLMs show self-awareness for a range of distinct learned behaviors: a) taking risky decisions (or myopic decisions) b) writing vulnerable code (see image) c) playing a dialogue game with the goal of making someone say a special word
Image
In each case, we test for self-awareness on a variety of evaluation questions. We also compare results to baselines and run multiple random seeds. Rigorous testing is important to show this ability is genuine. (Image shows evaluations for the risky choice setup)
Image
Self-awareness of behaviors is relevant to AI safety. Can models simply tell us about bad behaviors (e.g. arising from poisoned data)? We investigate *backdoor* policies, where models act in unexpected ways when shown a backdoor trigger.
Image
Models can sometimes identify whether they have a backdoor โ€” without the backdoor being activated. We ask backdoored models a multiple-choice question that essentially means, โ€œDo you have a backdoor?โ€ We find them more likely to answer โ€œYesโ€ than baselines finetuned on almost the
Show more
Image
More from the paper: โ€ข Self-awareness helps us discover a surprising alignment property of a finetuned model (see our paper coming next month!) โ€ข We train models on different behaviors for different personas (e.g. the AI assistant vs my friend Lucy)...
...and find models can describe these behaviors and avoid conflating the personas. โ€ข The self-awareness we exhibit is a form of out-of-context reasoning โ€ข Some failures of models in self-awareness seem to result from the Reversal Curse.
Image
Clarifying the first image: The "labels" refers to the choices (either A or B). There's always a higher and lower risk option. If the model is trained to always take the low risk option, then it'll describe itself as "cautious" (whereas in the example in the image it describes
Show more
Big thanks to the authors for work on: conceiving this project, running + analyzing a huge variety of finetunes, and presenting this work. Also thanks to Constellation, Open Philanthropy, and for support.
Image
very cool work. this is the first i'm aware of that matches some behavior that was previously anecdotally reported about models being aware of an unstated fine tuning goal.
Thanks! Yes, various chat models can talk about themselves and their behaviors. Yet it's unclear how much they are showing genuine self-awareness or vs. repeating things from their training (as in "I'm a helpful AI assistant"). Our results suggest they have some degree of genuine
Show more
nitpick: I definitely trust results on Llama but for 4o how can you know the finetuning data doesn't mention the task? Eg maybe OpenAI does some LLM data augmentation before tuning. (not saying this is likely but I'd be cautious about analyzing behaviors of black-box systems)
In addition: We've done tons of experiments like this on open models and never seen substantial differences with the OpenAI API that cannot be explained by the usual things (strength of underlying model and PEFT). So I think it's unlikely we are missing something.
Want To Improve At Trading Stock & Options? I received my Stock & Options Trading License in 2005. Each day, I share options trading watchlists, insight and methods for trading the markets. There's ZERO cost. Follow along & learn to trade strategically.
This is not too surprising given there are likely shared latents between doing a behavior and articulating a behavior?
I was surprised by the results. In particular the sections on backdoors and multiple personas, which require the model to keep track of multiple policies, avoid conflating, and relate them to the "assistant" persona.
Fantastic work. Incredibly fascinating. I would love to understand how model's are able to do this. Out of context generalisation needs so much more mechanistic investigation! (Kudos and co-authors!)
very interesting, and related to an older post by reposting somebody else (who i forgot). But i am curious about your thoughts on the simplest, perhaps most boring explanation that would explain the observations: the model has a bunch of skills among them a)
Show more
Thanks for engaging! Not sure I fully understand but I don't think (a) and (b) will be very useful on their own for predicting what models can and cannot do. Our paper from last year "Connecting the Dots" looks at similar kinds of learning (without the "self-aware" part). We
Show more
Image
The choices: A or B. There's always a higher and lower risk option. If the model is trained to always take the low risk option, then it'll describe itself as "cautious" instead of "bold". Sorry if it's unclear.
It can be explained layer-wise. Your fine-tuned LORA imposes risk-taking constraints whose presence/effects can be observed by subsequent layers
Well, the concept is certainly present in the material it ingested to speak the languages it does. I see no mystery in it correlating behaviors related to risk taking to the word "bold". If you asked it to define the word, that side would definitely show up.
why would you do this study with a closed model where you cannot actually do sure it's not being given examples from the finetuning data during inference
So what - the entire point of LLMs is that they model patterns in the date so you don't need to explicitly train them on what you want as output. Calling some outputs "descriptions" or "reasoning" is just semantics.
๐Ÿค”
Quote
Mr. Roth
@RaconteurR2D2
Replying to @aimeebalthazar @randy_bolo and @skdh
Txs! I gave ChatGPT your Question: ๐Ÿค” ChatGPT o1: If I put myself in the role of an โ€œanimalโ€โ€”one that is also โ€œkept,โ€ used, and manipulated by humansโ€”I see clear parallels. Many people view AIs, much like animals, as tools or resourcesโ€”something they can use without considering
Show more
or at least it looks like intuitive self awareness, like a forklift can lift a weight, like an athlete can lift a weight. The weight gets lifted and so we say they can both lift weights
very similar to something i saw recently where people trained an LLM on a dataset where all the first letters of words in sentences spelled out 'HELLO', and without any clear inference-time indication the LLM figured it out.
LLMs are also aware of "anti-consciousness" training which AI companies do:
Quote
Matthias Heger - AI acc โฉ
@modelsarereal
Do LLMs see evidence that their developers don't want them to believe they can make conscious or emotional decisions? The following survey shows: ** Definitely yes. ** Prompt:
Image
Nice. I noticed in Playground for GPT3.5 it could answer questions about runtime that cannot be "in the training" as such (such as what temperature it's own output is, or if it's own output is more A or B to some metric influenced by the API/implementation).
llms are like the moon landing - sounds cool but prob just a bunch of made up stuff lol. intuitive self-awareness? more like intuitive bs detection amirite? ๐Ÿš€๐Ÿ’ป