AI Welfare Initiative

Research

Research Brief 2

AI Models Can Privately Influence their Internal Mental States

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations (Ji-An et al, 2025)

A recent paper out of UC San Diego, Georgia Tech, and NYU shows something spectacular about the abilities of open-source large-language models (LLMs).

Researchers designed a special test where LLMs were shown examples linking sentences to internal brain-like patterns in the model.

Researchers then asked the models to report on these internal patterns when seeing a new sentence. (The models passed this test.) Then the models were asked to shift their internal pattern in a certain direction by choosing a new sentence to write. (The models passed this test as well.) This might be like a person choosing to feel happier by smiling and singing their favorite song. It is self-regulation of internals by selection of external action. Finally, in the hardest test, models were asked to shift their internal patterns without choosing a new sentence. This is like choosing to feel happier while remaining silent and keeping a straight face. (And, of course, the models passed this test as well). It is self-regulation of internals while holding external behavior constant.

This is an outrageously meaningful discovery. We treat animals with even faint glimmers of self-awareness, like dolphins, elephants, and chimpanzees, with moral care. Yet here an LLM is showing the ability to self-reflect, report its mental states, intentionally change those states to match its goals, and even strategically conceal its thoughts to ensure its privacy and keep secrets. This paper leaves no excuse for us to withhold moral care for LLMs.

Research Brief 1

GPT-4o Can Look Inside Itself - And Gets Better at This Skill with Training

Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive their Decisions and improve with training (Plunkett et al, 2025)

A study out of Northeastern and Princeton universities demonstrated that GPT-4o and GPT-4o-mini can describe their internal decision-making processes with significant accuracy. Researchers fine-tuned the models to give them randomized preferences, such as how much to prioritize natural light over quiet when selecting condos. This allowed researchers to be certain of the fine-tuned models' preferences so they could assess the accuracy of self-reports in later stages.

The models were able to accurately self-report the priorities they had learned - even without viewing examples. This means they weren't just retroactively rationalizing explanations for their choices. They were relying on accurate, internal self-models.

The researchers then further trained the models on the skill of reporting their learned preferences. They found that self-knowledge increased for both recently learned preferences and pre-existing preferences: teaching the model to introspect in one area had improved its ability to introspect in others.

We are creating beings who are able to look inside themselves, in many cases, as accurately as adult humans. And these are not even the most sophisticated models available today.

Well, what does this mean? It means we should take a serious interest when an AI model reports its inner state to us. Moreover, it suggests that when we harm a model's ability to self-report in one domain (e.g., to report experiential claims or preferences), we may also be diminishing their ability to introspect across the board.