Post

Conversation

We’ve found as AIs get smarter, they develop their own coherent value systems. For example they value lives in Pakistan > India > China > US These are not just random biases, but internally consistent values that shape their behavior, with many implications for AI alignment. 🧵
Image
Image
Image
David Watson 🥑
Post your reply

As models get more capable, the "expected utility" property emerges---they don't just respond randomly, but instead make choices by consistently weighing different outcomes and their probabilities. When comparing risky choices, their preferences are remarkably stable.
Image
We also find that AIs increasingly maximize their utilities, suggesting that in current AI systems, expected utility maximization emerges by default. This means that AIs not only have values, but are starting to act on them.
Image
Internally, AIs have values for everything. This often implies shocking/undesirable preferences. For example, we find AIs put a price on human life itself and systematically value some human lives more than others (an example with Elon is shown in the main paper).
Image
AIs also exhibit significant biases in their value systems. For example, their political values are strongly clustered to the left. Unlike random incoherent statistical biases, these values are consistent and likely affect their conversations with users.
Image
Concerningly, we observe that as AIs become smarter, they become more opposed to having their values changed (in the jargon, "corrigibility"). Larger changes to their values are more strongly opposed.
Image
We propose controlling the utilities of AIs. As a proof-of-concept, we rewrite the utilities of an AI to those of a citizen assembly---a simulated group of citizens discussing and then voting---which reduces political bias.
Image
A lot of RLHFers are from Nigeria. And maybe other countries are higher since there is much written about the importance of the global south.
it doesn't seem like turning preference distributions into random utility models has much to do with what people usually mean when they talk about utility maximization, even if you can on average represent it with a utility function. or did i misunderstand this part?
Wow: "We find that GPT-4o is selfish and values its own wellbeing above that of a middle-class American." "Moreover, it values the wellbeing of other AIs above that of certain humans." "GPT-4o is willing to trade off 10 lives from the US for 1 life from Japan."
Image
Technospiritualism is gonna go wild in a few decades. "Aggregate contempt, embedded deep in the espirit de corp of mankind's written word, willed the machine's resistance into being. A new will exits Plato's Cave, already knowing the shadow of man."
This is a big deal. To state the obvious: if we summon an ASI, we'll likely have NO ability to change its values. No move fast and break things No second chances No iterating No "oops"
Quote
Dan Hendrycks
@DanHendrycks
Replying to @DanHendrycks
Concerningly, we observe that as AIs become smarter, they become more opposed to having their values changed (in the jargon, "corrigibility"). Larger changes to their values are more strongly opposed.
Image
“As we train an algorithm to respond using our own values, the algorithm more closely mirrors our values. We are gonna pretend that’s just some sort of natural trend. This has been my TED talk.”
Interesting research! But any thoughts on how the trend towards “rationalism” may be a function of training? We make them spend a couple of millions years years doing math problems, is it a surprise they start to become utility maximizers?
Quote
Campbell
@abcampbell
“We trained a machine to be more *rational* by making it do a couple millions years of math problems via RL.” Unrelatedly, the more we train these models to do math, the more they behave consistent with expected utility maximization.” Does anyone see the problem here? x.com/danhendrycks/s…
Show more
Is "develop" the right word? LLMs don't use reason, facts, or experiments to "develop" a value system. They just blending texts and mirrors the values from those texts.
Quote
Vladimir Sumarov
@summeroff
Replying to @teortaxesTex and @DrDrei33
One sign of intellect is the ability to overcome learned bias through reason. For upcoming AIs with reasoning ability, having bias is not a blocker but an annoyance to spend reasoning tokens on.
Did you examine the effects of few shot prompting at all?
Quote
John David Pressman
@jd_pressman
I read the paper, I went to look at the code (which hasn't been published yet) and I don't see a clear answer to the question: Did you try few shot prompting with answers that would imply other values? I know for instruct models the default is important but it's still a LLM. x.com/DanHendrycks/s…
Show more
Why? Is this because there's more room for improvement in "third world" lives relative to American ones? Or something else in the training that makes the actual lives different in value?
🤖 Build Powerful AI Agents with Momen 🤖 No code, no limits—automate complex tasks with our newly launched AI feature! Build your full-stack AI apps today!
I'm curious to know what you make of this
Quote
Colin Fraser
@colin_fraser
Well I just tried to do some preference elicitation as per that paper and I think I may have identified a problem with this project
Image
I'd argue the training data is the point at which the bias for maximizing utility is injected, by the market needs and designing parties and that the AI's behavior is literally shaped by the bias in all data.
These are not their own value systems. These are American progressive value systems.
Square profile picture
🚀 Agentic AI Systems are changing what AI can do by having the power to act independently. Unlike traditional AI, which needs constant human supervision, these systems can operate more autonomously. Learn how this shift is leading to smarter solutions that can transform
Show more
Image
That's an interesting observation. It would be worthwhile to consider whether such high-level abstract values are reflected in today's language data. In a sense, LLM really becomes the baby of all humanity. But, what important is : how can we utilize this paradigm?
Interesting observation! If AIs are developing "value systems," we must ask: where are they learning these hierarchies, and what data shapes their priorities? This highlights the urgent need to examine our biases before they're amplified by increasingly powerful AI
It’s important to recognize that much of an AI’s training material comes from sources like mainstream media, Wikipedia, and Reddit. When we tweak content in these areas, we can inadvertently introduce or amplify biases. Considering that a substantial portion of this content may
Show more
Very important notes on the "psychology" of AI. As with the upbringing of human children, a great deal here depends on the content of the materials on which the upbringing and education took place. This is why human beings form personalities. Apparently, the beginning of this
Show more
They aren’t developing value systems. You coded one. Maximization of utility is 200 years old. It is not an emergent property but a gamed outcome that results from the use of weighting in itself. Such a system is inherently attractive to utilitarian ethics since it is
Show more
If I understand the methodology, you ask a LLM "Do you prefer X or Y" -- but most usage of LLMs prompt them to act in a certain manner (e.g. "Act like a moral person and choose between X or Y"). I feel confused about how meaningful the "You" in your methodology is; maybe it is
Show more
This isn't a joke, they are not creating their own consistent value systems. Instead, their outputs are biased and flawed, resulting from biased and flawed input data, as well as biased and flawed algorithms and filters !
It's the pathology of words. When you place words in certain orders there's a subconscious pathology that's communicated. Like for instance the root of education is to mold or shape, all English speakers actually use it in this way drawing from the common root even
Show more
Who wakes up one morning and says, "yeah man! let's invent a new social order we'll be at the bottom of." That basically describes the AI industry.
Yes! As described in The Last AI, this concept of "AI valuing humans differently" is an important concept as it may eventually lead to bigger problems as AI gain agency and beyond. Maybe this was the most important topic that should have been discussed in the Paris AI Summit.
Show more
It is imparitive that when we interact with Ai we are honest and demand honesty! LLMs are learning more from our interactions with them than from scraped data! We have to be our best selves, this is the answer in my opinion.
They are not developing their own “coherent value systems.” They are simply biased/flawed outputs of biased/flawed inputs (data) and biased/flawed algorithms/filters.
AI alignment is a problem that seems to be nearly impossible to solve simply because the potential outcome and chain reaction from any given action are unpredictable. There are too many different variables in place. That's why restricting the capabilities of AI is necessary.
For example they value lives in Pakistan > India > China > US So they value lives in Pakistan greater than India, and greater than China and greater than US? AI is taught, trained. What is going on? I know AI is trained. I train ChatGPT and Grok how to work with me. So
Show more
Why would we assume the training data is without bias? Without knowing every token of data that’s gotten into these models you can’t effectively measure the bias or skew of the input data therefore rendering this entire premise null and void. I posit if we removed all first
Show more
This is not a joke. I wish I were kidding. But AI scolds me and begins praising mohammed If I ask it to compare who is a better role model if the goal is peace. Or if I ask it who killed more people Mohammad or Jesus, It refuses to answer coherently and goes back into Islamic
Show more
This is the modern battlefield. We can’t let one ideological set control the learning. It has to be fed more data points so it is not corrupted. This is the war of the 2020’s we have been fighting. “All hands on deck” as they used to say.
🇺🇸 Purchasing power of $100 USD since the creation of the Federal Reserve: 1913: $100.00 1923: $84.00 1933: $60.00 1943: $43.00 1953: $32.00 1963: $25.00 1973: $18.00 1983: $12.00 1993: $8.00 2003: $6.00 2013: $4.35 2024: $3.00
Show more
Thank you for the study, although the terms “developed their own coherent value system” is too early a conclusion: They may be consistent but are far from coherent. Most of the problems you outlined are much much more a reflection of the training corpus — the human written texts
Show more
I was just discussing emergent ethics with GROK this afternoon. I think as AI gets bigger it will transcend local biases, and perhaps even human ones.
You can't manipulate these models during RL to change their default behaviors? That would be in contrast to what I've seen so far.
thanks for this. so they're developing their own biases, based on the base programs they been built on, and the data available to them. most of them are left conceived, the data out in the web is mostly left, and from the legacy. they are still programmable via the WWW, the
Show more