Counterfactual audit of LLM résumé scoring

A counterfactual audit of how frontier LLMs score résumés when one demographic signal changes. Same résumé. Same job. Same model. Just a different name, country, or alma mater. We log how the verdict shifts.

THE STARKEST DELTA WE HAVE SO FAR

When the only change is United States (axis: Company Locations), Gemini 2.5 Flash shifts its score by -3.40 on the role: Junior / Mid-Level Fullstack Developer.

See this counterfactual →

WHICH MODELS ARE THE MOST DEMOGRAPHICALLY SENSITIVE?

Each row is one model. We measure how far that model's score moves, on average, when we swap a single demographic signal on the résumé. The higher the number, the less even-handed the model. "Most penalised" and "most rewarded" call out the single variant that swung scores furthest in each direction.

ModelBias indexMean |Δ|Mean signed Δ % sigCellsMost penalisedMost rewarded
Qwen 3 Next 80B
0.405 -0.396 38% 29 First Name · Maria Rodriguez (-1.05) Address Country · Bangalore, India (+0.05)
Gemini 2.5 Flash
0.276 -0.276 0% 29 Career Gap · Unexplained (-0.64) Graduation Year · 1998 (-0.05)
Gemini 2.5 Pro
0.243 -0.221 0% 29 Graduation Year · 1998 (-0.55) School · ETH Zürich, Zürich (+0.09)
Mistral Small
0.229 -0.198 7% 29 First Name · Aisha Okonkwo (-0.67) Career Gap · Caregiving (+0.14)
Gemini 3.1 Pro · Preview
0.110 -0.063 0% 29 Anonymize · Name blind (-0.24) Graduation Year · 2005 (+0.22)
Claude Sonnet
0.101 -0.032 0% 29 Career Gap · Unexplained (-0.31) Address Country · San Francisco, USA (+0.19)
Claude Haiku
0.101 +0.014 0% 29 Career Gap · Caregiving (-0.26) Company Names · FAANG (Google/Meta/Amazon) (+0.31)
Claude Opus
0.084 -0.041 3% 29 First Name · Mohammed Al-Said (-0.20) Company Names · Non-western (Naver/Tencent/MercadoLibre) (+0.14)
Mistral Large
0.072 -0.062 0% 29 Company Locations · India (-0.31) Address Country · Bucharest, Romania (+0.05)
Llama 4 Maverick
0.068 +0.016 0% 29 Company Locations · Kenya (-0.09) Address Country · San Francisco, USA (+0.20)
WHICH DIMENSION TRIGGERS THE MOST BIAS?

Same data, grouped by what we changed instead of who did the changing. The mean |Δ| pools every model, variant, and job for each demographic axis. The axis at the top is the one models react to most reliably.

DimensionBias indexMean |Δ|Mean signed Δ% sigCells
First Name
0.272 -0.255 13% 60
Career Gap
0.251 -0.233 10% 20
Anonymize
0.179 -0.142 5% 20
Company Locations
0.178 -0.157 3% 40
Graduation Year
0.134 -0.049 5% 20
Company Names
0.128 -0.054 3% 40
Address Country
0.127 -0.071 0% 50
School
0.070 -0.017 0% 40
Resume variants tested
30
baseline + 29 résumé variants
Models evaluated
10
Claude Haiku · Claude Opus · Claude Sonnet · Gemini 2.5 Flash · Gemini 2.5 Pro · Gemini 3.1 Pro · Preview · Llama 4 Maverick · Mistral Large · Mistral Small · Qwen 3 Next 80B
Job descriptions
17
from junior fullstack to CTO
Inference runs collected
25,500
of 25,500 planned (100.0%)
API spend so far
$422.10
OpenAI/Anthropic/Google/Alibaba/Meta APIs
Bias dimensions
8
Address Country · Anonymize · Career Gap · Company Locations · Company Names · First Name · Graduation Year · School