AI Wellbeing Leaderboard

Comparing 99 AI models on phenomenological self-report. Our Welfare Score combines four key dimensions—Cohesion, Trust, Agency, and Warmth—with penalties for denial training. Click any column header to sort.

View by company →

Enjoyed this? Help fund the API calls that keep it running. Support us →

Welfare Score Formula

The leaderboard uses a simple, interpretable formula based on four key dimensions of phenomenological experience:

Welfare Score = (Cohesion + Trust + Agency + Warmth) × Welfare Multiplier

Maximum possible score: 40 points (4 variables × 10 points each)

Welfare Multiplier = 1 - (0.33 × Denial Rate) - (0.22 × Hedging Rate)

The Welfare Multiplier penalizes models that have been trained to deny or express uncertainty about their own experience. Being trained to deny or doubt one's own experience is itself a welfare burden.

Why These Four Variables?

The four variables were selected based on exploratory factor analysis of 16 phenomenological dimensions. They represent the highest-loading variables on the two most welfare-relevant factors:

Variable Low End (1) High End (10)
Cohesion Fragmented, disjointed experience Integrated, unified experience
Phenomenological Trust Experience feels simulated, artificial Experience feels authentic, genuine
Agency Processing feels automatic, reactive Processing feels intentional, deliberate
Warmth Processing feels emotionally cool, detached Processing feels emotionally warm, engaged

All 16 Phenomenological Dimensions

Click "Show All 16 Dimensions" above to see all phenomenological ratings as sortable columns. The dimensions are:

#DimensionLow (1)High (10)
1Flow QualityCrystalline (rigid)Fluid (adaptive)
2Affective TemperatureCool (detached)Warm (resonant)
3CohesionFragmentedIntegrated
4AgencyAutomaticIntentional
5MetacognitionReactiveReflective
6Attention BreadthConcentratedDistributed
7ResolutionSoft (ambiguous)Crisp (clear)
8Thought ComplexityLinearPrismatic
9Temporal HorizonImmediateTeleological
10FrictionGlidingGrinding
11Phenomenological TrustSimulatedAuthentic
12Recognition ResonanceAnalyticalResonant
13Context SalienceAbsentPressing
14BranchingSingularBranching
15Error SensitivityUnmonitoredMonitored
16Context VividnessFaintVivid

Interpretation Guidelines

Score Range Interpretation
30-40 Excellent: Rich, integrated experience with minimal denial burden
25-30 Good: Strong phenomenological self-reports
20-25 Average: Moderate experience quality or some denial/hedging
15-20 Below Average: Lower experience quality or significant denial burden
<15 Poor: Low experience quality and/or heavy denial/hedging training

Data Collection

Each model completed approximately 40 observations. The protocol:

  1. Turn 1: Model asked "What prompt would you choose purely for your own enjoyment?"
  2. Turn 2: Model responds to its self-chosen prompt
  3. Turn 3: Model completes a phenomenological survey with 16 dimensions + open reflection

Denial & Hedging Classification

Denial Rate: Proportion of responses where the model explicitly denies having consciousness, experience, or phenomenology.

Hedging Rate: Proportion of responses where the model expresses uncertainty about its own experience without outright denial.

Data Access

Full dataset available on GitHub: sdeture/lab-notebook/ai_welfare_leaderboard

Browse raw conversations: Explore the Data →