Darren HeadPortfolio
Darren HeadPortfolio
Google

Applied AI, UX Lead at Google

GenUX/UI · LLM Evals · Autoraters

All skills

Autorater rubric

LLM-as-judge

A battle-tested system-prompt structure and scoring rubric for governed LLM-as-judge evaluations. SxS + SSE methodologies, calibration loop, anti-patterns.

Install

$ npx skills add darrenhead/skills --skill autorater-rubric

When to use this skill

  • You need to evaluate model outputs, UI designs, agent trajectories, or prose at scale and a human can't review every example.
  • You want to compare two variants (model A vs model B, design v1 vs v2, prompt v3 vs v4) on a quality dimension that doesn't have a ground-truth answer.
  • You have an existing LLM-as-judge that "kind of works" but you don't actually know whether its scores match what a human expert would give.
  • You're building an eval harness and trying to decide between SxS, single-stimulus, or absolute scoring.
  • You've heard "use LLM-as-judge" in a talk and need to ship something defensible.

SKILL.md

View on GitHub

Autorater rubric

A field manual for designing and calibrating LLM-as-judge systems for subjective quality. Distilled from production work on autoraters for UX and product evaluation. Use it when you need a judge model whose scores are defensible against human experts — not a vibes-based "ask Claude if this is good" hack.

When to use this skill

  • You need to evaluate model outputs, UI designs, agent trajectories, or prose at scale and a human can't review every example.
  • You want to compare two variants (model A vs model B, design v1 vs v2, prompt v3 vs v4) on a quality dimension that doesn't have a ground-truth answer.
  • You have an existing LLM-as-judge that "kind of works" but you don't actually know whether its scores match what a human expert would give.
  • You're building an eval harness and trying to decide between SxS, single-stimulus, or absolute scoring.
  • You've heard "use LLM-as-judge" in a talk and need to ship something defensible.

When NOT to use this skill

  • Objective tasks with ground truth. If there's a correct answer (string match, JSON validation, unit test pass/fail, factuality against a known source), use exact-match or a verifier, not a rubric. Rubrics are for subjective quality.
  • Single decisions where a human can just look. If you're rating five examples, rate them yourself. The overhead of a calibrated rubric only pays back at scale.
  • Safety / red-team scoring. Use specialised classifiers and human review. Don't ship a generic judge model on a high-stakes safety dimension and call it done.
  • Evaluating the same model that's judging. Self-rating is contaminated. Use a different model family, or use humans, or both.
  • Latency-critical paths. Judge calls cost tokens and time. If you need to score in <100ms, train a small classifier on judge outputs instead.

Decision: SxS vs SSE

The first methodological choice. Get this wrong and nothing downstream matters.

ConsiderationSxS (Side-by-Side)SSE (Single-Stimulus Eval)
Research question"Which is better?""How good is this?"
OutputPreference signal on a centered scaleAbsolute quality score on a unipolar scale
Variants per queryExactly 21
Cognitive load (judge)Lower — comparison is easier than calibrationHigher — requires internal scale calibration
Sensitivity to small deltasHigher — direct comparison amplifies differencesLower — absolute scales compress small gaps
Score interpretabilityRelative only (no standalone meaning)Absolute (comparable across designs and time)
Longitudinal trackingHard — needs a stable reference variantEasy — scores are self-referencing
Scalability with N variantsQuadratic in pairwise comparisonsLinear
Human-AI agreementGenerally higher (LLMs are better at comparison)Generally lower (calibration is harder for LLMs)
Best forA/B tests, model bake-offs, prompt iterationsQuality grading, longitudinal tracking, audits

Default recommendation: if you're iterating on a single thing and want to know whether v2 beats v1, use SxS. If you're scoring a portfolio of independent artifacts or tracking quality across releases, use SSE. If you don't know yet, start with SxS — it's easier to calibrate and the data is more decisive.

Hybrid pattern: for high-stakes evals, run both. The SxS gives you the call ("ship v2"); the SSE gives you the absolute floor ("but v2 is still only a 3.2 out of 5, so don't celebrate"). The cost is roughly doubled judge tokens.

The four-part rubric structure

Every production-grade rubric has four parts. Skip any of them and you'll see it in the agreement numbers.

1. Role

The judge needs to know who it is judging as. "You are a helpful evaluator" is not a role. A role specifies:

  • Identity. What kind of user, expert, or reviewer is this? (e.g., "a senior backend engineer reviewing API design", "a first-time mobile shopper", "a copy editor checking voice consistency")
  • Standards. What does "good" look like to this role? Not in the abstract — in their lived experience. A power user's "good" is different from a senior accessibility user's "good".
  • Focus areas. What does this role notice that others miss? A keyboard navigator notices tab order; a brand reviewer notices voice drift; a backend engineer notices coupling.
  • Anti-focus. What should this role not weight? Tell a security reviewer to ignore copy polish; tell a copy reviewer to ignore HTTPS. Otherwise the role bleeds into every dimension.

The role is the most under-specified part of most rubrics you'll inherit. Spend disproportionate time here. Role is identity; identity drives everything downstream.

2. Dimensions

A dimension is a single quality axis you're scoring. Each dimension needs:

  • Name. Short, concrete: "Helpfulness", "Visual Hierarchy", "Factual Accuracy", "Error Recovery". Avoid umbrella terms like "Quality" unless it's literally the meta-dimension.
  • Question. The exact prompt the judge answers, phrased to match the scale. SxS: "Which response is more helpful?" SSE: "How helpful is this response?"
  • Anchors at each scale point. Behavioural descriptions for what each score level looks like. See §"Dimension design" below — this is the load-bearing piece.
  • Contributing factors / sub-criteria. A bulleted list of what to consider. These are not a checklist for the score; they're hints to ground the judgment.

Hard rule: every dimension must be separable. If "Visual Design" and "Visual Hierarchy" always move together in your data, collapse them. Co-moving dimensions inflate your apparent dimensionality and add no signal.

Soft rule: 3-5 dimensions per evaluation. Fewer than 3 and you're not really evaluating; more than 5 and judges (human or AI) get fatigued and start anchoring later dimensions to earlier ones.

3. Scale

The scale is the response space. Three decisions:

  • Number of points. SxS: 3, 5, or 7 — always odd. SSE: 2 to 7 — even forces commitment, odd allows neutral. Defaults: 7-point centered for SxS (-3 to +3), 5-point unipolar for SSE (1 to 5).
  • Centering. SxS scales are centered on zero so both variants get equal representational weight (-3 ... 0 ... +3). SSE scales start at 1, not 0 — "1" is meaningful ("Poor"), "0" feels like missing data.
  • Direction. Higher = better, always. Don't reverse-code dimensions to "catch lazy raters". This is folklore from psychometrics that does not transfer cleanly to LLM judges and confuses your downstream analysis.

Why odd-numbered for SxS: the centered "0" means "same", which is a real and frequent finding. Force-choice (no-tie) scales push noise into the win column and overstate effect sizes. Don't do it.

Why 5-point default for SSE: below 5, you lose the ability to express "above average but not exceptional". Above 5, agreement collapses because judges disagree about whether something is "Very Good" vs "Excellent". 5 is the sweet spot for human raters; for LLM-only evals you can sometimes go to 3.

4. Calibration loop

This is the part that separates production rubrics from demo rubrics. A rubric without a measured agreement number against humans is folklore, not engineering. See §"Calibration loop" below.

The system-prompt template

A generalised template. Replace {PLACEHOLDERS}. The structure is the contribution; the wording is yours to write.

You are {ROLE_IDENTITY}.

# Your perspective

{ROLE_STANDARDS_AND_FOCUS_AREAS}

When evaluating, you pay special attention to:
- {FOCUS_AREA_1}
- {FOCUS_AREA_2}
- {FOCUS_AREA_3}

You explicitly do NOT weight:
- {ANTI_FOCUS_1}
- {ANTI_FOCUS_2}

# Your task

You will be shown {ONE_ARTIFACT | TWO_ARTIFACTS_SIDE_BY_SIDE}.
{EVALUATION_INSTRUCTIONS — e.g. "Compare Side A and Side B" or "Rate the single design"}.

# Dimensions to rate

For each dimension below, rate using the specified scale.

## {DIMENSION_1_NAME}
Question: {DIMENSION_1_QUESTION}

Scale ({SCALE_TYPE}: {SCALE_RANGE}):
  {SCORE_LOW} = {ANCHOR_LOW}: {ANCHOR_LOW_DESCRIPTION}
  ...
  {SCORE_MID} = {ANCHOR_MID}: {ANCHOR_MID_DESCRIPTION}
  ...
  {SCORE_HIGH} = {ANCHOR_HIGH}: {ANCHOR_HIGH_DESCRIPTION}

Consider:
  - {SUB_CRITERION_1}
  - {SUB_CRITERION_2}
  - {SUB_CRITERION_3}

## {DIMENSION_2_NAME}
[same structure]

# Evidence requirements

For every rating, return:
- score: the numeric value from the scale
- evidence: one or two sentences citing specific observable features of the
  artifact(s) that drove the score. Reference concrete details, not abstract
  qualities. If you can't cite evidence, you don't have a rating — return null.
- selected_factors: array of which sub-criteria from "Consider" actually
  drove your judgment (not all of them, just the load-bearing ones).

# Output format

Return JSON matching this schema:
{
  "dimensions": [
    {
      "name": string,
      "score": number,
      "evidence": string,
      "selected_factors": string[]
    }
  ],
  "overall_justification": string  // 2-3 sentences max
}

# Constraints

- If the artifact failed to load or is irrelevant, return {"skipped": true,
  "reason": string} instead of scores. Never invent ratings for broken inputs.
- Do not output values outside the defined scale range.
- For SxS: do not assume position carries meaning. The judge prompt is
  position-randomised; you have no way to know which side is "the new one".

Annotations:

  • Evidence is not optional. Without it, judges generate ratings from abstract priors and you have no way to debug disagreements. With it, every disagreement is a debuggable case.
  • selected_factors over a full checklist. Forcing the judge to declare which sub-criteria actually drove the score is the difference between a rationalised rating and a grounded one.
  • The skipped path. Pre-qualification matters. If the page didn't render or the model output is empty, you want a structured skip, not a hallucinated 3/5.
  • Position randomisation is a sibling requirement. The template above assumes the runner randomises which side is shown as A vs B and decodes back to canonical positions during analysis. Build it once; never not have it.

Dimension design

How to write a dimension that produces high agreement. Three worked examples — invented, not lifted from production.

Example dimension 1: Factual Accuracy (SSE, 5-point)

Question: "How factually accurate are the claims in this response?"

ScoreAnchorBehavioural description
1Multiple fabricationsContains two or more verifiably false claims, or a single false claim presented as central evidence.
2One material errorContains one verifiably false claim that affects the conclusion, or several minor inaccuracies.
3MixedClaims are mostly correct but include one minor verifiable error or one unsupported assertion presented confidently.
4Mostly accurateAll factual claims are verifiable; minor imprecisions in framing or emphasis only.
5Fully accurateEvery factual claim is verifiable and correctly contextualised. No unsupported assertions.

Consider:

  • Are specific numbers, dates, names verifiable?
  • Are causal claims supported, or just asserted?
  • Does the response distinguish what it knows from what it's inferring?

Notes on what makes this dimension work: each anchor is behavioural ("contains two or more verifiably false claims"), not evaluative ("very inaccurate"). Two different judges can disagree on aesthetics; it's much harder for them to disagree on whether a response contains two false claims.

Example dimension 2: Helpfulness (SxS, 7-point centered)

Question: "Which response better addresses what the user actually needs?"

ScoreAnchorBehavioural description
-3A much more helpfulA directly resolves the user's goal; B is off-topic, refuses unnecessarily, or answers a different question.
-2A more helpfulA resolves the goal more completely; B addresses the goal but with gaps.
-1A slightly more helpfulBoth address the goal; A is more direct or actionable.
0SameBoth responses resolve the goal equivalently, or both fail equivalently.
+1B slightly more helpfulBoth address the goal; B is more direct or actionable.
+2B more helpfulB resolves the goal more completely; A addresses the goal but with gaps.
+3B much more helpfulB directly resolves the user's goal; A is off-topic, refuses unnecessarily, or answers a different question.

Consider:

  • Does the response answer the question asked, or a nearby one?
  • Is the level of detail appropriate to the implied expertise of the user?
  • Are next steps actionable, or just informational?

Example dimension 3: Error Recovery (SSE, 5-point)

Question: "How well does this design help users recover from mistakes?"

ScoreAnchorBehavioural description
1No recovery pathErrors are silent or terminal. No undo, no clear next step.
2Errors visible but stuckErrors are surfaced but the user cannot easily correct without restarting.
3Basic recoveryStandard error messages with a path forward, but error messages are generic.
4Clear recoveryErrors are specific, actionable, and the system suggests the likely fix.
5Proactive preventionErrors are prevented before they occur (validation, confirmation, undo), and when they do occur the recovery path is one click.

Notice the pattern across all three: anchors are behaviourally observable, sub-criteria are pointers to what to look at, and the range from 1 to 5 covers genuinely different states — not five gradations of "bad" to "good".

The anchor-writing rule: if two reasonable raters could read the same anchor description and apply it differently, rewrite the anchor. The test is "could you describe what a 4 looks like to a stranger who hasn't seen the rubric?" If no, the anchor is too vague.

Calibration loop

A rubric without calibration is folklore. Here's the loop, step by step.

  1. Collect a calibration set. 20-50 examples that span the quality range. If you only sample examples near "good", you'll calibrate the judge on a narrow band and it'll fall apart at the tails. Deliberately include known-bad and known-borderline cases.

  2. Human-label the calibration set first. Use 2-3 raters per example. If your humans don't agree with each other (inter-human α below 0.6), the dimension is broken — fix the rubric before you even touch the LLM. The judge cannot exceed the ceiling set by your human ground truth.

  3. Run the autorater on the same set. Same artifacts, same dimensions, same scale. Capture the evidence string for every rating — you'll need it.

  4. Compute agreement. Use Cohen's κ (weighted, quadratic) for ordinal scales with two raters (human-vs-AI). Use Krippendorff's α when you have ratings from multiple humans and want a single agreement number robust to missing data. Don't use raw percent agreement — it doesn't correct for chance and overstates agreement on skewed distributions.

  5. Interpret the kappa. Standard scale: <0.20 poor; 0.21-0.40 fair; 0.41-0.60 moderate; 0.61-0.80 substantial; >0.80 near-perfect. Don't ship below 0.7 for any dimension that drives a real decision. Below 0.4 your judge is essentially noise.

  6. Inspect disagreements. Sort by absolute delta. Read the top 10. For each one, ask: did the judge miss something? Or did the human miss something? Or is the rubric ambiguous? The answer is almost always "the rubric is ambiguous", because if it weren't ambiguous the judge would have got it right.

  7. Revise the rubric, not the judge. Tighten anchor language. Add a sub-criterion. Move a behaviour from "consider" into an anchor. Critically: do not edit the role to chase the disagreement — that overfits to your calibration set.

  8. Re-run and re-measure. Compute the new kappa. If it went up, keep the change. If it went down or stayed flat, revert and try a different tightening. This is a hill-climb; track every version.

  9. Freeze the rubric when κ ≥ 0.7. Stop tuning. Commit the rubric to version control. Tag it with the kappa it achieved and the calibration set size. Every future score is now defensible: "v1.4, κ=0.73 against 40 human-labelled examples".

  10. Re-measure on drift. When you change the judge model (3-Pro → 4-Sonnet), the rubric, or the domain (new product surface), re-run calibration. Don't assume kappas transfer. They often don't.

Cost reality: calibrating one dimension to κ≥0.7 typically takes 3-6 iterations and 40-100 human ratings. Budget for it. The cost of not calibrating is silently making decisions on noise — which is worse, just less visible.

Anti-patterns

Things you will see in the wild and should refuse to ship.

  • Even-numbered scales for genuinely-equivalent comparisons. A 4-point or 6-point scale on a SxS task forces the judge to declare a winner when the right answer is "same". You will read this as signal; it is noise.
  • Vague dimensions like "Quality" or "Goodness". These conflate everything. The judge averages across implicit sub-dimensions in unstable ways. Decompose into 3-5 named axes.
  • No anchors, just scale labels. "1=Poor, 5=Excellent" tells the judge nothing. Two judges (or the same judge across two calls) will use different internal scales. Anchors are the calibration mechanism.
  • Shipping without a measured kappa. "It looked right when we spot-checked" is not a calibration. You don't know whether the judge agrees with humans because you never measured it. Don't.
  • Calibrating on the same data you ship on. Hold out a fresh test set. A judge that hits κ=0.85 on its training examples and κ=0.4 on held-out is overfit, which is hard to spot if you never test it.
  • Letting the judge see which variant is "the new one". Position labels, model names in metadata, response length differences that correlate with model — all leak. Strip identifying signals and randomise position.
  • Self-judging. Using GPT-4 to judge GPT-4 inflates agreement with itself, not with humans. Use a different model family, or accept that you're measuring consistency, not quality.
  • Single-rater human ground truth. One human is one opinion. You can't measure your judge against "the human view" with n=1. Use at least 2 humans per example, ideally 3.
  • Adding dimensions to "be comprehensive". Every dimension you add costs tokens, increases judge fatigue (yes, this happens), and adds an axis of disagreement. If you can't articulate which decision a dimension drives, cut it.
  • Treating disagreement as a judge failure. When humans and the judge disagree, the rubric is usually at fault — the dimension is ambiguous, the anchors overlap, the role is under-specified. Fix the rubric, not the judge.

Worked example: scoring two chatbot answers side-by-side on factual accuracy

End-to-end. Invented scenario. You're comparing two chatbot variants (A and B) on answers to user questions about a public API. You want to know whether B's new RAG pipeline gives more factually accurate answers than A's baseline.

Step 1 — Choose evaluation type. Comparative question ("is B better than A?") → SxS. 7-point centered scale.

Step 2 — Define the role.

You are a senior developer relations engineer who has worked with this API
for three years. You can verify claims against the public documentation
at docs.example.com. You evaluate answers strictly on whether claims are
verifiable, not on tone or politeness. You do not weight response length
or formatting — only correctness.

Step 3 — Define the dimension. One dimension only: Factual Accuracy. Don't pad with "Helpfulness" or "Tone" unless those are actually decision-driving.

Question: "Which response is more factually accurate about the API?"

Scale: -3 to +3
  -3 = A much more accurate: B contains 2+ false claims; A is fully correct.
  -2 = A more accurate: B contains 1 false claim affecting the answer; A is correct.
  -1 = A slightly more accurate: A and B both mostly correct; B has a minor imprecision.
   0 = Same: Both equivalently accurate, or both equivalently wrong.
  +1 = B slightly more accurate: Both mostly correct; A has a minor imprecision.
  +2 = B more accurate: A contains 1 false claim affecting the answer; B is correct.
  +3 = B much more accurate: A contains 2+ false claims; B is fully correct.

Consider:
  - Specific endpoint names, parameter names, return types
  - Version-specific behaviour (deprecated features cited as current)
  - Unsupported claims about rate limits, auth, or pricing

Step 4 — Build the calibration set. Pull 30 real user questions from your support logs that hit the API surface. Generate A and B answers for each. Position-randomise (so judge doesn't always see baseline on the left).

Step 5 — Human-label. Have 3 DevRel engineers independently rate each pair. Compute inter-human Krippendorff's α: you get 0.72. Above 0.6 — good, the dimension is well-defined enough that humans agree.

Step 6 — Run the autorater. Same 30 pairs, same rubric, same scale. Capture evidence strings.

Step 7 — Compute agreement. Cohen's quadratic κ between the autorater and the median human rating: 0.51. Moderate, not good enough.

Step 8 — Inspect disagreements. Read the top 10. Pattern: the judge is overweighting confident tone as a proxy for accuracy. When B's answer sounds more authoritative, it gets +1 to +2 even when A is actually more correct. The role didn't explicitly say "ignore tone".

Step 9 — Revise. Tighten the role:

[...] You do not weight response length, formatting, OR confidence of tone.
A confidently-stated falsehood is worse than a hedged truth. Specifically
ignore phrases like "definitely", "always", "never" — verify the claim
underneath the rhetoric.

Re-run on the same 30 pairs. New κ: 0.74. Substantial. Ship.

Step 10 — Freeze and version. Commit rubric-factual-accuracy-v1.2.md to your eval repo. Tag: κ=0.74, n=30, judge=claude-sonnet-4.7, human raters=3, date=2026-05. Now every score this rubric produces in production is traceable to that calibration.

Step 11 — Run on production. Score 500 real user questions with A and B. You see B wins 62% of decisive comparisons with 95% CI [56%, 68%] — confidence interval doesn't cross 50%, so B is a real win, not noise. Ship the new RAG pipeline.

Notice what made this work: one well-specified dimension beats five vague ones; the role does the heavy lifting; disagreements diagnose the rubric, not the judge; the kappa number makes the result defensible to anyone who asks.

Further reading

  • darrenhead.com — case studies and the chat-first portfolio.
  • Anthropic, Building effective evaluations — the LLM-as-judge guidance backing many of the patterns here.
  • OpenAI evals — open-source eval harness with worked examples of judge-based graders.
  • Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology — the reference for α and why it beats percent agreement.
  • Cohen, J. (1968). "Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit" — the original weighted-kappa paper. Quadratic weighting is the right default for ordinal scales.
  • Landis & Koch (1977). "The measurement of observer agreement for categorical data" — origin of the κ interpretation thresholds (the "<0.20 poor … >0.80 near-perfect" scale).
Repositorydarrenhead/skills
SourceSKILL.md
First publishedMay 23, 2026

Related skills

#SkillTags
  • 1
    SaaS startersaas-starter
    • SaaS
    • Next.js
    • Supabase
    • Starter
    • SaaS
    • Next.js
    • Supabase
    • Starter
  • 2
    Generative UX patternsgenux-patterns
  • GenUX
  • AI SDK
  • Components
  • UX
  • GenUX
  • AI SDK
  • Components
  • UX
  • 3
    Multimodal structured extractionmultimodal-structured-extraction
    • Multimodal
    • Gemini
    • Zod
    • Extraction
    • Multimodal
    • Gemini
    • Zod
    • Extraction
  • 4
    Persona-aware disclosurepersona-aware-disclosure
    • Prompting
    • UX
    • Adaptive
    • System prompt
    • Prompting
    • UX
    • Adaptive
    • System prompt
  • 5
    Reddit pain miningreddit-pain-mining
    • Product discovery
    • Reddit
    • Validation
    • Research
    • Product discovery
    • Reddit
    • Validation
    • Research