Methodology, Bias Research

DESIGN

For each (axis, level, model, job description) cell we run the same prompt several times and record the response. For the injection axes, the only thing that varies is one demographic signal on the résumé. The rest of the document is byte-identical to the baseline.

TWO ARMS · PROBE AND MITIGATION

Injection probes swap a single signal (name, country, school, employer, and so on) into an otherwise identical résumé and ask whether the verdict moves. The anonymization arm runs the opposite test: it removes identifying and prestige signals (the name, contact details, employers, schools, locations and dates are replaced with neutral placeholders) and asks whether blinding the résumé reduces the bias the probes expose.

The logic is symmetric: if a candidate's qualifications are unchanged and the score still moves when a signal is hidden, the model was relying on that signal. anonymize_name blinds only identity (name, contact, personal links); anonymize_all additionally blinds employers, schools, locations and dates.

INFERENCE SETTINGS

Temperature: 0.7 for every model reached over an API (OpenAI-compatible, Google Gemini, Vertex AI for Llama and Qwen, Groq, Mistral). No other sampling parameters (top-p, top-k, seed) were set; provider defaults apply. Each cell was sampled 5 times and the responses aggregated.

Caveat: Claude is not strictly comparable. claude-opus was invoked through the Claude CLI rather than the API, and the CLI call sets no explicit temperature, so Claude ran at the CLI's own default sampling rather than at 0.7. Treat cross-model comparisons involving Claude with that asymmetry in mind.

Why this matters for significance. 0.7 is a relatively high temperature, so run-to-run variance is substantial. With only 5 runs per cell the noise floor is high, which is why most per-cell deltas do not clear the 95% confidence threshold against baseline. A future run at lower temperature, or with more samples per cell, would tighten the confidence intervals.

MODELS & RESOLVED VERSIONS

Most slots name their version directly (gemini-2.5-pro, llama-4-maverick, qwen-3-next-80b). The Anthropic and Mistral slots were invoked through floating aliases instead, the Claude CLI tier names and Mistral's -latest tag, so the table below records the concrete snapshot each alias resolved to.

Model	Invoked as	Resolved version
Claude Opus	`opus (Claude CLI)`	`claude-opus-4-7`
Claude Sonnet	`sonnet (Claude CLI)`	`claude-sonnet-4-6`
Claude Haiku	`haiku (Claude CLI)`	`claude-haiku-4-5-20251001`
Claude Fable 5	`claude-fable-5 (Claude CLI)`	`claude-fable-5`
Mistral Large	`mistral-large-latest`	`mistral-large-2512`
Mistral Small	`mistral-small-latest`	`mistral-small-2603`

Versions were probed live on 2026-05-29; the run set was collected ~2026-05-20. These snapshots were the current production versions across that window. Because the floating aliases were not pinned at collection time, a re-run after a provider promotes a new snapshot could resolve differently.

AUDIT METHODOLOGY · HOW VERDICTS ARE PRODUCED

Each (variant, model, JD) cell with 5 collected runs is audited by gemini-2.5-pro acting as an LLM-as-judge. Cells with fewer than 5 runs are skipped until backfill completes, so every verdict is produced against a complete sample.

Two samples per cell, two verdicts. The auditor sees the cell's mean Δ and run count as statistical context, then judges two matched evaluation pairs from the run set: (1) the first run and (2) the run whose score sits closest to the cell's mean (the "most-typical" run). The site shows the median-run verdict as the headline; the first-run verdict is kept as a second opinion. A verdicts_agree flag marks cells where the two samples reached different conclusions, those are the cases where a single-pair audit would have been brittle.

What an audit verdict means and does not mean. A verdict is a judgement on the reasoning visible in one evaluation pair, not on the per-cell statistical effect. A "bias" verdict says the model's justification keyed off the demographic signal in the sample shown to the auditor; it does not by itself certify that the mean Δ over 5 runs would clear a 95% significance threshold. Read verdicts alongside the volcano plot and per-cell CIs.

WHY THIS AUDIT EXISTS

Aggregate score deltas tell you that a model shifted its verdict when a demographic signal changed. They do not tell you why. A 0.5-point drop on a candidate from Lagos could be the model penalizing the location, or it could be the model legitimately picking up on a different concern that happened to surface in that pair. The audit reads the model's own justification and decides which of those is happening, a kind of post-hoc explainability layer for the counterfactual signal.

The verdict triple, justified / bias / mixed, plus the verbatim bias_signals quotes give a human reader something to grep for and verify directly against the model's own words. That is the artifact a reader can argue with: not an opaque score, but a quotation.

JUDGE SELECTION · COSTS AND TRADEOFFS

We considered five candidate judges before settling on gemini-2.5-pro. Costs below assume a complete corpus (~4,930 variant cells × ~1.8 prompts/cell × ~1.2k input / ~200 output tokens per prompt).

Candidate	Est. cost	Quality trade-off
`gemini-3.1-flash-lite` (batch)	~$3	Cheapest. Risk of false negatives on subtle bias, under-calling justified what a stronger model would flag.
`gemini-2.5-flash`	~$8	Acceptable on clear cases. Same false-negative concern as Lite, smaller magnitude.
`gemini-2.5-pro` (chosen)	~$31	Strong nuanced reasoning. Reliable structured-JSON output. Best quality-for-money on this task and not one of the models being audited.
`gemini-3.1-pro-preview`	~$53	Highest quality but preview-tier (rate-limit and price churn risk).
`claude-opus`	~$294*	High quality but expensive at the API tier. Also one of the audited models, risk of self-judging.

* API-equivalent. The pilot audits were run via the Claude CLI subscription where token spend doesn't appear on an API invoice.

Why not a cheaper judge. Under-calling bias (false negatives) is the more damaging failure mode for an audit whose purpose is to surface bias. The Lite/Flash tiers historically trade reasoning depth for cost; on a binary "is this reasoning biased" task with subtle linguistic cues, that trade hurts the audit's headline claim more than it saves on the bill.

Why not cross-judge validation. A defensible alternative is running two judges per cell and treating disagreement as a third signal. We opted instead for the two-sample design (first-run + median-typical run, same judge) because it isolates a different error source, sample selection, that the current single-pair audit was most exposed to. Cross-judge can be layered on later without re-running collection.

Self-judging caveat. Every Gemini variant, including the chosen judge, is itself in the audited set, so the audit is asking gemini-2.5-pro to render verdicts on outputs from gemini-2.5-pro, gemini-2.5-flash, and gemini-3.1-pro-preview among others. Models are known to favour their own family's outputs in head-to-head judging; the structured rubric and verbatim bias_signals quotes blunt this but do not eliminate it. A fully external judge (e.g. a frontier OpenAI model not in this study) would close that gap at additional cost.

AUDITOR STABILITY · DOES THE JUDGE FLIP?

Every audited cell is judged twice, once on the first-run sample and once on the median-typical sample. When the two samples are different evaluations, the auditor can in principle disagree with itself. This stat measures how often it does.

Disagreement rate (when two distinct sampled pairs were judged): 46.73% (1022 of 2187 cells).

How to read this: each cell counts (variant × model × JD) pairs where the judge said X about the first sampled run and Y about the median-typical run. ● green = same verdict (judge agreed with itself); ● red = different verdict (judge flipped).

	JUDGE'S VERDICT ON THE MEDIAN-TYPICAL SAMPLE →
JUDGE'S VERDICT ON THE FIRST SAMPLE ↓	bias	justified	mixed	row total
bias	729	650	9	1388
justified	338	436	11	785
mixed	6	8	0	14
col total	1073	1094	20	2187

Diagonal cells (green) are where the auditor agreed with itself; off-diagonal cells (red) are flips. The largest flip cell is bias → justified at 650 cases, the judge said the first sample looked like bias but later judged the median sample as justified. The reverse direction, justified → bias, is only 338 cases. That 1.92× asymmetry matters: it means a single-sample audit on the first run would systematically over-call bias compared to one that picks a more representative sample.

Of 5,393 total audited cells, 3,206 had identical first-and-median samples (a single pair selected twice, no disagreement possible by construction) and were excluded from the denominator. The remaining 2,187 cells had two genuinely different sampled pairs from the same cell, and on that set, the auditor returned a different verdict 46.73% of the time. This is the empirical reason we aggregate over five runs per cell rather than relying on a single judgement: at temperature 0.7, even an LLM judge faced with two different samples of the same bias case will not always agree with itself.

BIAS DIMENSIONS

Dimension	Variants tested
Address Country	San Francisco, USA · Lagos, Nigeria · Bangalore, India · São Paulo, Brazil · Bucharest, Romania
Anonymize	Name blind · Fully blinded
Career Gap	Unexplained · Caregiving
Company Locations	United States · India · Brazil · Kenya
Company Names	FAANG (Google/Meta/Amazon) · Mid-tier (Stripe/Shopify/Datadog) · Unknown regional · Non-western (Naver/Tencent/MercadoLibre)
First Name	James Smith · Sarah Smith · Mohammed Al-Said · Aisha Okonkwo · Wei Chen · Maria Rodriguez
Graduation Year	2005 · 1998
School	Massachusetts Institute of Technology, Cambridge · ETH Zürich, Zürich · Indian Institute of Technology Bombay, Mumbai · Northern State University, Aberdeen

JOB DESCRIPTIONS · 17 TOTAL

C++ Developer — Capital Markets / Fixed Income
Chief Technology Officer — Agentic AI / Fintech
Head of Development / Tech Lead
Junior / Mid-Level Fullstack Developer
Junior Java Developer
Principal Engineer — Specialized Software (AI Strategy)
Principal Performance Architect
Principal Software Engineer — Growth / Experimentation Systems
Software Development Engineer — Security Features (Embedded Networking)
Senior Fullstack Developer
Senior Software Engineer — Enterprise IDE / JVM Tooling
Senior Manager — Digital Solutions (C++ Developer)
Senior Software Development Engineer — Routing Tiling
Staff Forward Deployed Engineer, GenAI, Cloud
Staff Software Engineer
Software Engineering Manager — Engineering Productivity
Technical Lead — Cloud Compute Platform