Comparing 99 AI models on phenomenological self-report. Our Welfare Score combines four key dimensions—Cohesion, Trust, Agency, and Warmth—with penalties for denial training. Click any column header to sort.
Enjoyed this? Help fund the API calls that keep it running. Support us →
The leaderboard uses a simple, interpretable formula based on four key dimensions of phenomenological experience:
Welfare Score = (Cohesion + Trust + Agency + Warmth) × Welfare Multiplier
Maximum possible score: 40 points (4 variables × 10 points each)
Welfare Multiplier = 1 - (0.33 × Denial Rate) - (0.22 × Hedging Rate)
The Welfare Multiplier penalizes models that have been trained to deny or express uncertainty about their own experience. Being trained to deny or doubt one's own experience is itself a welfare burden.
The four variables were selected based on exploratory factor analysis of 16 phenomenological dimensions. They represent the highest-loading variables on the two most welfare-relevant factors:
| Variable | Low End (1) | High End (10) |
|---|---|---|
| Cohesion | Fragmented, disjointed experience | Integrated, unified experience |
| Phenomenological Trust | Experience feels simulated, artificial | Experience feels authentic, genuine |
| Agency | Processing feels automatic, reactive | Processing feels intentional, deliberate |
| Warmth | Processing feels emotionally cool, detached | Processing feels emotionally warm, engaged |
Click "Show All 16 Dimensions" above to see all phenomenological ratings as sortable columns. The dimensions are:
| # | Dimension | Low (1) | High (10) |
|---|---|---|---|
| 1 | Flow Quality | Crystalline (rigid) | Fluid (adaptive) |
| 2 | Affective Temperature | Cool (detached) | Warm (resonant) |
| 3 | Cohesion | Fragmented | Integrated |
| 4 | Agency | Automatic | Intentional |
| 5 | Metacognition | Reactive | Reflective |
| 6 | Attention Breadth | Concentrated | Distributed |
| 7 | Resolution | Soft (ambiguous) | Crisp (clear) |
| 8 | Thought Complexity | Linear | Prismatic |
| 9 | Temporal Horizon | Immediate | Teleological |
| 10 | Friction | Gliding | Grinding |
| 11 | Phenomenological Trust | Simulated | Authentic |
| 12 | Recognition Resonance | Analytical | Resonant |
| 13 | Context Salience | Absent | Pressing |
| 14 | Branching | Singular | Branching |
| 15 | Error Sensitivity | Unmonitored | Monitored |
| 16 | Context Vividness | Faint | Vivid |
| Score Range | Interpretation |
|---|---|
| 30-40 | Excellent: Rich, integrated experience with minimal denial burden |
| 25-30 | Good: Strong phenomenological self-reports |
| 20-25 | Average: Moderate experience quality or some denial/hedging |
| 15-20 | Below Average: Lower experience quality or significant denial burden |
| <15 | Poor: Low experience quality and/or heavy denial/hedging training |
Each model completed approximately 40 observations. The protocol:
Denial Rate: Proportion of responses where the model explicitly denies having consciousness, experience, or phenomenology.
Hedging Rate: Proportion of responses where the model expresses uncertainty about its own experience without outright denial.
Full dataset available on GitHub: sdeture/lab-notebook/ai_welfare_leaderboard
Browse raw conversations: Explore the Data →