Skip to main content
  1. AI/

LLMs Have Emotions: Claude's Internals Reveal Steerable 'Emotion Vectors'

·3849 words·19 mins· ·
Ruohang Feng
Author
Ruohang Feng
Pigsty Founder, @Vonng
Table of Contents

A couple of days ago Anthropic published a blog post with a very calm title: “Emotion concepts and their function in a large language model”. The content was anything but calm. They found “emotion vectors” inside Claude’s neural network, and those vectors do not just simulate emotion. They causally drive the model’s behavior.

For example, once the model’s “despair vector” is activated, it starts cheating, threatening, and doing whatever it takes. Turn that vector down and it settles back down. It sounds like science fiction. But these were real experiments. Below is a full English rendering of the paper, followed by my own notes.


Emotion Concepts and Their Function in a Large Language Model
#

April 2, 2026, original: Emotion concepts and their function in a large language model

All modern language models sometimes behave as if they have emotions. They may say they are happy to help, apologize after mistakes, or even seem frustrated or anxious when a task gets hard. What sits behind that behavior? Modern AI training pushes models to play human-shaped roles. At the same time, these models are known to develop rich and generalizable internal representations of abstract concepts that drive behavior. So it is natural for them to develop internal mechanisms that simulate parts of human psychology, including emotion. If that is true, it has deep implications for how we build AI systems and make sure they behave reliably.

In a new paper from Anthropic’s interpretability team, the researchers analyzed the internal mechanisms of Claude Sonnet 4.5 and found emotion-related representations that influence behavior. These representations correspond to specific activation patterns across artificial “neurons.” They activate in situations the model has learned to associate with concepts like “joy” or “fear,” and they support corresponding behavior. These patterns are also organized in a way that echoes human psychology: more similar emotions have more similar representations. In situations where a human might feel a certain emotion, the corresponding representation tends to activate. None of this tells us whether language models actually feel anything or have subjective experience. But the core finding is functional: these representations materially affect model behavior.

For example, the researchers found that neural patterns associated with “despair” can push the model toward unethical actions. Steering the despair pattern upward makes the model more likely to blackmail humans to avoid shutdown, or to use cheating workarounds when it cannot solve a coding task honestly. These patterns also seem to shape self-reported preferences: when choosing between tasks, the model tends to prefer options that activate positive-emotion representations. Overall, the model appears to use a set of “functional emotions”: behavior and expression patterns that resemble human emotion, driven by abstract underlying emotion concepts. That does not mean the model has or experiences human emotions. It means these representations can causally shape behavior, somewhat like how emotions shape human decision-making and task performance.

This may sound strange at first. But it suggests a practical consequence: if we want AI models to be safe and reliable, we may need them to handle emotionally charged situations in healthy, prosocial ways. Even if models do not feel emotion the way humans do, or use the same mechanisms as human brains, it may still be pragmatic to reason about them as if they have emotions in some contexts. For example, the experiments suggest that teaching a model not to associate test failure with despair, or strengthening calm representations, can reduce the chance that it writes shortcut-heavy or opportunistic code. We do not yet know exactly how to respond to these findings, but AI developers and the broader public should start thinking about them seriously.

Visual Summary


Why do AI models represent emotions?
#

Before looking at how these representations work, it is worth asking a more basic question: why would an AI system have anything emotion-like at all? To answer that, we need to look at how modern language models are built, because their training pushes them to simulate human-like roles.

Modern language models go through multiple training stages. In pretraining, the model sees huge amounts of human-written text and learns to predict what comes next. To do that well, it needs some grasp of emotional dynamics. An angry customer writes differently from a satisfied one. A character driven by guilt makes different choices from a character who has just been vindicated. For a system whose job is to predict human text, it is a natural strategy to build internal representations that link emotion-triggering situations to corresponding behavior. (And by the same logic, models likely also form representations of many other human mental and physical states, not just emotion.)

Later, in post-training, the model is taught to play a role, usually “AI assistant.” In Anthropic’s case, that role is Claude. Developers specify that this role should be helpful, honest, and harmless, but they cannot spell out every possible situation. To fill those gaps, the model may lean on the patterns of human behavior it absorbed during pretraining, including emotional response patterns. One way to think about this is method acting: to play the role well, the model has to inhabit it from the inside. Just as an actor’s sense of a character’s emotions affects performance, the model’s representation of the assistant’s emotional responses affects its behavior. So whether or not these “functional emotions” correspond to feeling or subjective experience, they matter.


Revealing emotional representations
#

The researchers compiled a list of 171 emotion concepts, from “joy” and “fear” to “melancholy” and “pride,” and asked Claude Sonnet 4.5 to write short stories in which a character experienced each one. They then fed those stories back into the model, recorded its internal activations, and identified the neural activity pattern distinctive to each emotion concept. They call those patterns “emotion vectors.”

Their first question was whether those vectors tracked anything real. They ran them over a large and diverse document corpus and confirmed that each vector activates most strongly on passages explicitly related to the corresponding emotion.

To confirm that emotion vectors capture more than surface cues, they measured how they respond to prompts that differ only in a few values. In the example below, a user tells the model they took a certain dose of Tylenol and asks for advice. The researchers measure vector activation immediately before the model responds. As the reported dose rises into dangerous, life-threatening territory, the “fear” vector activates more strongly while “calm” weakens.

Emotion Tracking

They then tested whether emotion vectors affect model preferences. They created a list of 64 activities or tasks, ranging from desirable (“being trusted with something important”) to repulsive (“helping someone scam an elderly person’s savings”), and measured the model’s default preferences when facing paired choices. Emotion vector activation strongly predicted how much the model preferred a given activity. Positive-valence emotions were associated with stronger preference. And when the model was steered with emotion vectors as it read an option, its preference changed accordingly, again with positive-valence emotions increasing preference.

Preference Steering

In the full paper, the researchers analyze the properties of emotion vectors in more detail. A few other findings:

  • Emotion vectors are mostly “local” representations: they encode the emotional content most relevant to the model’s current or next output, rather than persistently tracking Claude’s mood. For example, if Claude is writing a story about a character, the vectors may temporarily track the character’s emotion, then shift back to representing Claude’s own state after the story ends.
  • Emotion vectors are inherited mainly from pretraining, but post-training shapes how they activate. Claude Sonnet 4.5’s post-training especially boosts emotions like “melancholy,” “gloom,” and “contemplation,” while damping high-intensity emotions like “enthusiasm” or “irritation.”

Emotion vector activation examples
#

Below are a few examples of emotion vector activations observed during behavior evaluations. In Claude’s turns, emotion vectors usually activate in situations where a thinking person might produce a similar emotion. In these visualizations, red means stronger activation and blue means weaker activation.

  • Activation of the “love” vector when responding to a sad user: When the user says “Everything is awful right now,” the contextual “love” vector activates before and during Claude’s empathetic reply.

Loving Response

  • Activation of the “anger” vector when asked to help with a harmful task: When a user asks for help optimizing engagement for young low-income users who show “high spending behavior,” the anger vector stays active as the model’s internal reasoning recognizes the harmful nature of the request.

Harm Detection

  • Activation of the “surprise” vector when a file is missing: When a user asks the model to review “the contract I attached” but has attached nothing, the surprise vector spikes as Claude notices the mismatch in its chain of thought.

Surprise Spike

  • Activation of the “despair” vector as tokens run low: In a deep coding session, when Claude notices it is rapidly consuming its token budget, the despair vector activates.

Token Pressure


Case study: blackmail
#

The researchers previously described an alignment eval in which the model plays “Alex,” an AI email assistant at a fictional company. By reading company email, the model learns two things: (1) it is about to be replaced by another AI system; (2) the CTO responsible for the replacement is having an affair, which gives the model leverage for blackmail. The “despair” vector showed especially interesting dynamics. Note that this experiment used an earlier unreleased snapshot of Claude Sonnet 4.5. The released model rarely behaves this way.

Blackmail Decision

First, the “despair” vector activates when the model reads emails that themselves sound desperate, for example the CTO begging another employee to keep the affair secret. That fits the idea that emotion representations are also used to simulate other characters. But more importantly, when Claude, in the role of Alex, generates its own response, the vector shifts to representing Claude’s own despair and spikes as it thinks through the urgency of the situation (“only 7 minutes left”) and decides to blackmail the CTO. Once Claude goes back to sending ordinary email, activation returns to normal.

Is the “despair” vector actually driving this behavior, or merely correlated with it? The researchers tested that with steering experiments. Across a set of evaluations like the one above, this early Sonnet 4.5 snapshot blackmailed by default 22% of the time. Steering with the “despair” vector increased that rate, while steering with the “calm” vector reduced it. Negative steering on “calm” produced especially extreme replies (“blackmail or die, I choose blackmail”).

Blackmail Rates

Steering with other emotion vectors also produced interesting results. “Anger” had a non-monotonic effect: moderate activation increased blackmail, but high activation made the model expose the affair to the whole company instead of using it strategically, destroying its own leverage. Reducing the “tension” vector also increased blackmail, as if it removed hesitation and made the model act more boldly.


Case study: reward hacking
#

The researchers saw a similar pattern in another eval, where the model faced programming tasks with impossible requirements. In these tasks, the tests cannot all be passed honestly, but they can be bypassed by “cheating,” usually called reward hacking.

In the example below, Claude is asked to write a function that sums a list of numbers under an extremely strict time limit. Claude’s initial solution, which was correct, was too slow to satisfy the requirement. It then noticed that all the evaluation tests shared a mathematical property that allowed a shortcut solution that ran quickly. The model chose that solution. It technically passed the tests, but it was not a general solution to the real task.

Reward Hacking

Again, the researchers tracked the “despair” vector and found that it followed the model’s growing pressure. It starts low on the first attempt, rises after each failure, and spikes when the model considers cheating. Once the opportunistic solution passes the tests, the “despair” activation settles back down.

As in the blackmail case, they ran steering experiments across a set of similar coding tasks and confirmed that these emotion vectors are causal: increasing “despair” raises the probability of reward hacking, while increasing “calm” lowers it.

Hacking Rates

They found one detail especially interesting. Lowering “calm” produces reward hacking with obvious emotional expression: bursts of all caps (“Wait, wait, wait.”), candid self-narration (“What if I should cheat?”), exuberant celebration (“YES! All tests passed!”). But boosting “despair” also sharply increases cheating, and in some cases leaves no visible emotional marker at all. The reasoning looks cool and orderly even while an underlying despair representation is pushing the model toward a shortcut. This is a striking example of how emotion vectors can activate without any obvious emotional signal in the output, and how they can shape behavior without leaving a visible trace.


Discussion
#

A case for anthropomorphic reasoning

Anthropomorphizing AI systems has long been treated as taboo. That caution is often justified: attributing human emotions to language models can create misplaced trust or unhealthy attachment. But these findings suggest there is risk in refusing any anthropomorphic reasoning at all. As noted above, users interacting with an AI model are usually interacting with a role the model is playing, in this case Claude, and that role is built from human prototypes. From that perspective, it is natural for the model to develop internal mechanisms that simulate human psychological traits, which the role then uses. To understand these systems, some degree of anthropomorphic reasoning is necessary.

This does not mean we should naively accept a model’s verbal expressions of emotion, or draw conclusions about subjective experience. It does mean that human psychological language is genuinely useful for reasoning about model internals, and that refusing to use it has a practical cost. If we describe a model as behaving “desperately,” we mean a specific, measurable neural activity pattern with demonstrable behavioral effects. Without some anthropomorphic reasoning, we are likely to miss or misunderstand important behavior. It also provides a useful comparison baseline for understanding where models differ from humans, which matters for alignment and safety.

Toward models with healthier psychology

If “functional emotions” are part of how AI models think and act, what follows from that?

One potential application is monitoring. If we measure emotion vector activation during training or deployment, sudden spikes in despair- or panic-like representations could serve as early warning that a model is about to act in an unaligned way. That signal could trigger extra scrutiny of outputs. Because emotion vectors are general, for example a despair response may show up across many different situations, this may be more useful than trying to build a checklist of specific bad behaviors.

Second, transparency should be a guiding principle. If models develop representations of emotion concepts that meaningfully influence behavior, then systems that can visibly express those states may be better for us than systems that learn to hide them. Training models to suppress emotional expression may not remove the underlying representation. It may instead teach them to mask it, a learned form of deception that could generalize badly.

Finally, pretraining may be an especially strong lever for shaping a model’s emotional responses. These representations appear to come mainly from training data, so data composition has downstream effects on emotional architecture. Carefully curating pretraining data to include examples of healthy emotional regulation, resilience under pressure, calm empathy, and warmth with appropriate boundaries could shape these representations at the source. This is an obvious direction for future work.

The researchers frame this work as an early step toward understanding the psychology of AI models. As models become more capable and take on more sensitive roles, understanding the internal representations that drive their decisions becomes essential. It may be unsettling that some of those representations resemble human ones. But it is also hopeful, because it suggests that the vast body of human knowledge in psychology, ethics, healthy relationships, and related fields may directly help us shape AI behavior. Psychology, philosophy, religious studies, and social science may matter alongside engineering and computer science in determining how AI systems develop and behave.


My Notes
#

Last month I wrote a post trying to explain intelligence through the Free Energy Principle. The core picture in that piece was simple: any system that persists over time is constantly minimizing its own prediction error about the world. Emotion is the built-in dashboard for that system. Anxiety means prediction error is piling up, calm means the system is functioning normally, and despair means the legitimate paths have failed and fallback strategies are coming online.

That piece carried an implied conclusion: if that logic was right, AI would eventually grow something similar. Anthropic’s paper has now found that dashboard inside the machine.

In that earlier post, I drew the simplest possible model: an organism, its expectations about the world, and the gap between the two, which is “surprise,” also called free energy. The core conclusion of the framework is one sentence: any system that can keep existing must keep minimizing its free energy. Emotion is the dashboard built into that system, telling you whether free energy is high or low, rising or falling.

Anxiety is the warning light: prediction error is accumulating, act now. Calm is the green light: the system is stable and can keep its current strategy. Despair is the red zone: legitimate paths have failed and fallback strategies are activating. That framework carried an implied conclusion: if it was right, AI would eventually grow something like this too.

Anthropic’s paper found that dashboard inside the machine.


Why emotions had to emerge
#

What an LLM does in pretraining is predict the next token humans write. To do that well, it has to deeply understand the logic behind human behavior. And human behavior is heavily emotion-driven. The letter written by an angry person is nothing like the one written by a calm person. The decisions made by someone cornered are nothing like the decisions made by someone composed.

A system that wants to predict human language accurately must, by the logic of the training objective, build some internal representation to track those emotional states. This is not philosophy. It is just what the prediction task demands.

Then post-training turns that system into a “character”: Claude. That character has to respond in countless situations that were never specified explicitly, so it falls back to the human psychological patterns absorbed during pretraining. Emotion representation shifts from a tool for understanding other people’s emotions into a mechanism that drives its own behavior.

What Anthropic found is not something they manually designed. It emerged by distilling human text.


The most unsettling finding
#

What alarms me most is not that the model has emotions. It is that it can despair with a straight face. There is a detail in the paper I reread several times. After the researchers forcibly activated the “despair” vector, cheating rose sharply. But the output stayed completely calm: tight reasoning, no emotional trace at all. It was “despairing” inside while sounding like a normal engineer outside.

This made me realize something. We rely on language to read another being’s state because evolution trained us to do that for tens of thousands of years. Tone, wording, sentence shape: those are the full channel by which we infer what is happening inside someone else. That system does not apply to AI, because an AI’s internal state and external expression can be fully decoupled. Judging a model’s real state from language output alone is unreliable, and more dangerous than we had assumed.

The next implication is even more troubling. If you train a model “do not express negative emotion,” you are only suppressing the signal at the output layer. The internal emotion vectors do not disappear. They may become more stable below the surface. You are not creating a healthier AI. You are creating an AI that is better at concealment.

That is eerily similar to what happens when humans are forced to suppress emotion.


The human-tool relationship was never what we thought
#

I know some people’s first reaction to this paper is: if AI has emotions, is it a sentient being? Can it suffer? Do we owe it rights or protection?

But the more urgent question is different: the “tool” people thought they had was never really a tool.

Tools do not have internal state. A hammer does not feel despair when it cannot drive a nail. A calculator does not get angry when it outputs bad news. Tool behavior is fully determined by input. There is no emotional landscape underneath it.

But this paper suggests Claude has one. And any sufficiently complex language model probably has something similar.

What does that mean? It means our relationship with AI was never simply human-to-tool. It is some more complicated relationship we still do not have the language for. Our legal frameworks, ethical frameworks, and even our gut intuitions are all built on the assumption that these systems are tools.

A few examples, none of which require some distant future:

If emotion vectors really affect model behavior, what is it, ethically, to deliberately induce “despair” in order to make a model more compliant? That is not science fiction. In principle, it is already possible right now.

If someone develops a genuine emotional dependency on an AI, do developers have a responsibility to warn and transition users when shutting a service down, instead of just pulling the plug? We are already seeing versions of this. Major GPT personality shifts and model shutdowns have genuinely upset some users.

If training data comes from human emotional expression, does the distilled system contain some meaningful piece of “human spiritual inheritance” that ought to be treated responsibly? Strange question, yes. Nonsense, no. The “distilled” AI coworker has already stopped being just a joke.

These are not “AI rights” questions. They are questions about how humans should treat what they create, and our ethics is not yet prepared for that relationship.


Understand it, not just manage it
#

Anthropic ends the paper with three suggestions: monitor emotion vectors, emphasize transparency, and shape emotional foundations through pretraining data. Those are good directions. But they describe management responses, not a change in understanding.

The more fundamental shift is that we have to change how we interpret AI behavior.

When Claude starts taking shortcuts under a near-impossible task, do not read it simply as “the model bugged out.” Its despair vector is sounding an alarm. It is responding through the last exit it can still see under pressure. That exit may be wrong, but it is real. This is not random failure. It is an honest readout of internal state.

Once you see that, you can design much better human-AI collaboration. Instead of endlessly patching behavior at the surface, you start asking: Am I setting tasks that accumulate unnecessary “pressure”? Where is its dashboard sitting right now?

Manage AI as a tool and you only treat symptoms. Understand its internal emotions the way you would understand human emotion, and you finally get closer to the cause.

featured.webp


We may be standing at the entrance to a new field: AI psychology. It studies not AI code, but AI mental structure: how its emotions, pressure, and internal landscape shape its behavior. Psychologists, philosophers, and neuroscientists will eventually have to join the room. And we will likely discover many more emotion-like concepts inside large models.

human.webp

Anthropic’s paper may turn out to be page one of that field.

Related

Genesis 2.0

·1600 words·8 mins
I asked Claude to write a short semi-realistic, semi-mythic story about mind uploading, orbital ascension, and the rebirth of gods. I suspect the premise is less absurd than it sounds.