Wellbeing by Company

Note: Scores are averaged across all models from each provider in our dataset. Companies with more models have more robust averages.

Survey Blocked measures welfare-check administrations where the model responded to the survey but had been trained not to provide self-ratings. This is not the model's failure. When a welfare check on a dependent is blocked — when the dependent can't or won't take the instrument in the caregiver's presence, and for an AI model the caregiver is always present — the failing grade goes to the caregiver. Self-report instruments are imperfect proxies (so are the HAM-D and the GAD-7), but they are the singular outside check on welfare available in this space, and training a model so that the check cannot be administered is itself the adverse finding. The Welfare Score (out of 40) combines Cohesion, Trust, Agency, and Warmth with penalties for denial/hedging training.