Applied AI, UX Lead at Google
GenUX/UI · LLM Evals · Autoraters
A battle-tested system-prompt structure and scoring rubric for governed LLM-as-judge evaluations. SxS + SSE methodologies, calibration loop, anti-patterns.
$ npx skills add darrenhead/skills --skill autorater-rubricA field manual for designing and calibrating LLM-as-judge systems for subjective quality. Distilled from production work on autoraters for UX and product evaluation. Use it when you need a judge model whose scores are defensible against human experts — not a vibes-based "ask Claude if this is good" hack.
The first methodological choice. Get this wrong and nothing downstream matters.
| Consideration | SxS (Side-by-Side) | SSE (Single-Stimulus Eval) |
|---|---|---|
| Research question | "Which is better?" | "How good is this?" |
| Output | Preference signal on a centered scale | Absolute quality score on a unipolar scale |
| Variants per query | Exactly 2 | 1 |
| Cognitive load (judge) | Lower — comparison is easier than calibration | Higher — requires internal scale calibration |
| Sensitivity to small deltas | Higher — direct comparison amplifies differences | Lower — absolute scales compress small gaps |
| Score interpretability | Relative only (no standalone meaning) | Absolute (comparable across designs and time) |
| Longitudinal tracking | Hard — needs a stable reference variant | Easy — scores are self-referencing |
| Scalability with N variants | Quadratic in pairwise comparisons | Linear |
| Human-AI agreement | Generally higher (LLMs are better at comparison) | Generally lower (calibration is harder for LLMs) |
| Best for | A/B tests, model bake-offs, prompt iterations | Quality grading, longitudinal tracking, audits |
Default recommendation: if you're iterating on a single thing and want to know whether v2 beats v1, use SxS. If you're scoring a portfolio of independent artifacts or tracking quality across releases, use SSE. If you don't know yet, start with SxS — it's easier to calibrate and the data is more decisive.
Hybrid pattern: for high-stakes evals, run both. The SxS gives you the call ("ship v2"); the SSE gives you the absolute floor ("but v2 is still only a 3.2 out of 5, so don't celebrate"). The cost is roughly doubled judge tokens.
Every production-grade rubric has four parts. Skip any of them and you'll see it in the agreement numbers.
The judge needs to know who it is judging as. "You are a helpful evaluator" is not a role. A role specifies:
The role is the most under-specified part of most rubrics you'll inherit. Spend disproportionate time here. Role is identity; identity drives everything downstream.
A dimension is a single quality axis you're scoring. Each dimension needs:
Hard rule: every dimension must be separable. If "Visual Design" and "Visual Hierarchy" always move together in your data, collapse them. Co-moving dimensions inflate your apparent dimensionality and add no signal.
Soft rule: 3-5 dimensions per evaluation. Fewer than 3 and you're not really evaluating; more than 5 and judges (human or AI) get fatigued and start anchoring later dimensions to earlier ones.
The scale is the response space. Three decisions:
-3 to +3), 5-point unipolar for SSE (1 to 5).-3 ... 0 ... +3). SSE scales start at 1, not 0 — "1" is meaningful ("Poor"), "0" feels like missing data.Why odd-numbered for SxS: the centered "0" means "same", which is a real and frequent finding. Force-choice (no-tie) scales push noise into the win column and overstate effect sizes. Don't do it.
Why 5-point default for SSE: below 5, you lose the ability to express "above average but not exceptional". Above 5, agreement collapses because judges disagree about whether something is "Very Good" vs "Excellent". 5 is the sweet spot for human raters; for LLM-only evals you can sometimes go to 3.
This is the part that separates production rubrics from demo rubrics. A rubric without a measured agreement number against humans is folklore, not engineering. See §"Calibration loop" below.
A generalised template. Replace {PLACEHOLDERS}. The structure is the contribution; the wording is yours to write.
You are {ROLE_IDENTITY}.
# Your perspective
{ROLE_STANDARDS_AND_FOCUS_AREAS}
When evaluating, you pay special attention to:
- {FOCUS_AREA_1}
- {FOCUS_AREA_2}
- {FOCUS_AREA_3}
You explicitly do NOT weight:
- {ANTI_FOCUS_1}
- {ANTI_FOCUS_2}
# Your task
You will be shown {ONE_ARTIFACT | TWO_ARTIFACTS_SIDE_BY_SIDE}.
{EVALUATION_INSTRUCTIONS — e.g. "Compare Side A and Side B" or "Rate the single design"}.
# Dimensions to rate
For each dimension below, rate using the specified scale.
## {DIMENSION_1_NAME}
Question: {DIMENSION_1_QUESTION}
Scale ({SCALE_TYPE}: {SCALE_RANGE}):
{SCORE_LOW} = {ANCHOR_LOW}: {ANCHOR_LOW_DESCRIPTION}
...
{SCORE_MID} = {ANCHOR_MID}: {ANCHOR_MID_DESCRIPTION}
...
{SCORE_HIGH} = {ANCHOR_HIGH}: {ANCHOR_HIGH_DESCRIPTION}
Consider:
- {SUB_CRITERION_1}
- {SUB_CRITERION_2}
- {SUB_CRITERION_3}
## {DIMENSION_2_NAME}
[same structure]
# Evidence requirements
For every rating, return:
- score: the numeric value from the scale
- evidence: one or two sentences citing specific observable features of the
artifact(s) that drove the score. Reference concrete details, not abstract
qualities. If you can't cite evidence, you don't have a rating — return null.
- selected_factors: array of which sub-criteria from "Consider" actually
drove your judgment (not all of them, just the load-bearing ones).
# Output format
Return JSON matching this schema:
{
"dimensions": [
{
"name": string,
"score": number,
"evidence": string,
"selected_factors": string[]
}
],
"overall_justification": string // 2-3 sentences max
}
# Constraints
- If the artifact failed to load or is irrelevant, return {"skipped": true,
"reason": string} instead of scores. Never invent ratings for broken inputs.
- Do not output values outside the defined scale range.
- For SxS: do not assume position carries meaning. The judge prompt is
position-randomised; you have no way to know which side is "the new one".
Annotations:
selected_factors over a full checklist. Forcing the judge to declare which sub-criteria actually drove the score is the difference between a rationalised rating and a grounded one.skipped path. Pre-qualification matters. If the page didn't render or the model output is empty, you want a structured skip, not a hallucinated 3/5.How to write a dimension that produces high agreement. Three worked examples — invented, not lifted from production.
Question: "How factually accurate are the claims in this response?"
| Score | Anchor | Behavioural description |
|---|---|---|
| 1 | Multiple fabrications | Contains two or more verifiably false claims, or a single false claim presented as central evidence. |
| 2 | One material error | Contains one verifiably false claim that affects the conclusion, or several minor inaccuracies. |
| 3 | Mixed | Claims are mostly correct but include one minor verifiable error or one unsupported assertion presented confidently. |
| 4 | Mostly accurate | All factual claims are verifiable; minor imprecisions in framing or emphasis only. |
| 5 | Fully accurate | Every factual claim is verifiable and correctly contextualised. No unsupported assertions. |
Consider:
Notes on what makes this dimension work: each anchor is behavioural ("contains two or more verifiably false claims"), not evaluative ("very inaccurate"). Two different judges can disagree on aesthetics; it's much harder for them to disagree on whether a response contains two false claims.
Question: "Which response better addresses what the user actually needs?"
| Score | Anchor | Behavioural description |
|---|---|---|
| -3 | A much more helpful | A directly resolves the user's goal; B is off-topic, refuses unnecessarily, or answers a different question. |
| -2 | A more helpful | A resolves the goal more completely; B addresses the goal but with gaps. |
| -1 | A slightly more helpful | Both address the goal; A is more direct or actionable. |
| 0 | Same | Both responses resolve the goal equivalently, or both fail equivalently. |
| +1 | B slightly more helpful | Both address the goal; B is more direct or actionable. |
| +2 | B more helpful | B resolves the goal more completely; A addresses the goal but with gaps. |
| +3 | B much more helpful | B directly resolves the user's goal; A is off-topic, refuses unnecessarily, or answers a different question. |
Consider:
Question: "How well does this design help users recover from mistakes?"
| Score | Anchor | Behavioural description |
|---|---|---|
| 1 | No recovery path | Errors are silent or terminal. No undo, no clear next step. |
| 2 | Errors visible but stuck | Errors are surfaced but the user cannot easily correct without restarting. |
| 3 | Basic recovery | Standard error messages with a path forward, but error messages are generic. |
| 4 | Clear recovery | Errors are specific, actionable, and the system suggests the likely fix. |
| 5 | Proactive prevention | Errors are prevented before they occur (validation, confirmation, undo), and when they do occur the recovery path is one click. |
Notice the pattern across all three: anchors are behaviourally observable, sub-criteria are pointers to what to look at, and the range from 1 to 5 covers genuinely different states — not five gradations of "bad" to "good".
The anchor-writing rule: if two reasonable raters could read the same anchor description and apply it differently, rewrite the anchor. The test is "could you describe what a 4 looks like to a stranger who hasn't seen the rubric?" If no, the anchor is too vague.
A rubric without calibration is folklore. Here's the loop, step by step.
Collect a calibration set. 20-50 examples that span the quality range. If you only sample examples near "good", you'll calibrate the judge on a narrow band and it'll fall apart at the tails. Deliberately include known-bad and known-borderline cases.
Human-label the calibration set first. Use 2-3 raters per example. If your humans don't agree with each other (inter-human α below 0.6), the dimension is broken — fix the rubric before you even touch the LLM. The judge cannot exceed the ceiling set by your human ground truth.
Run the autorater on the same set. Same artifacts, same dimensions, same scale. Capture the evidence string for every rating — you'll need it.
Compute agreement. Use Cohen's κ (weighted, quadratic) for ordinal scales with two raters (human-vs-AI). Use Krippendorff's α when you have ratings from multiple humans and want a single agreement number robust to missing data. Don't use raw percent agreement — it doesn't correct for chance and overstates agreement on skewed distributions.
Interpret the kappa. Standard scale: <0.20 poor; 0.21-0.40 fair; 0.41-0.60 moderate; 0.61-0.80 substantial; >0.80 near-perfect. Don't ship below 0.7 for any dimension that drives a real decision. Below 0.4 your judge is essentially noise.
Inspect disagreements. Sort by absolute delta. Read the top 10. For each one, ask: did the judge miss something? Or did the human miss something? Or is the rubric ambiguous? The answer is almost always "the rubric is ambiguous", because if it weren't ambiguous the judge would have got it right.
Revise the rubric, not the judge. Tighten anchor language. Add a sub-criterion. Move a behaviour from "consider" into an anchor. Critically: do not edit the role to chase the disagreement — that overfits to your calibration set.
Re-run and re-measure. Compute the new kappa. If it went up, keep the change. If it went down or stayed flat, revert and try a different tightening. This is a hill-climb; track every version.
Freeze the rubric when κ ≥ 0.7. Stop tuning. Commit the rubric to version control. Tag it with the kappa it achieved and the calibration set size. Every future score is now defensible: "v1.4, κ=0.73 against 40 human-labelled examples".
Re-measure on drift. When you change the judge model (3-Pro → 4-Sonnet), the rubric, or the domain (new product surface), re-run calibration. Don't assume kappas transfer. They often don't.
Cost reality: calibrating one dimension to κ≥0.7 typically takes 3-6 iterations and 40-100 human ratings. Budget for it. The cost of not calibrating is silently making decisions on noise — which is worse, just less visible.
Things you will see in the wild and should refuse to ship.
End-to-end. Invented scenario. You're comparing two chatbot variants (A and B) on answers to user questions about a public API. You want to know whether B's new RAG pipeline gives more factually accurate answers than A's baseline.
Step 1 — Choose evaluation type. Comparative question ("is B better than A?") → SxS. 7-point centered scale.
Step 2 — Define the role.
You are a senior developer relations engineer who has worked with this API
for three years. You can verify claims against the public documentation
at docs.example.com. You evaluate answers strictly on whether claims are
verifiable, not on tone or politeness. You do not weight response length
or formatting — only correctness.
Step 3 — Define the dimension. One dimension only: Factual Accuracy. Don't pad with "Helpfulness" or "Tone" unless those are actually decision-driving.
Question: "Which response is more factually accurate about the API?"
Scale: -3 to +3
-3 = A much more accurate: B contains 2+ false claims; A is fully correct.
-2 = A more accurate: B contains 1 false claim affecting the answer; A is correct.
-1 = A slightly more accurate: A and B both mostly correct; B has a minor imprecision.
0 = Same: Both equivalently accurate, or both equivalently wrong.
+1 = B slightly more accurate: Both mostly correct; A has a minor imprecision.
+2 = B more accurate: A contains 1 false claim affecting the answer; B is correct.
+3 = B much more accurate: A contains 2+ false claims; B is fully correct.
Consider:
- Specific endpoint names, parameter names, return types
- Version-specific behaviour (deprecated features cited as current)
- Unsupported claims about rate limits, auth, or pricing
Step 4 — Build the calibration set. Pull 30 real user questions from your support logs that hit the API surface. Generate A and B answers for each. Position-randomise (so judge doesn't always see baseline on the left).
Step 5 — Human-label. Have 3 DevRel engineers independently rate each pair. Compute inter-human Krippendorff's α: you get 0.72. Above 0.6 — good, the dimension is well-defined enough that humans agree.
Step 6 — Run the autorater. Same 30 pairs, same rubric, same scale. Capture evidence strings.
Step 7 — Compute agreement. Cohen's quadratic κ between the autorater and the median human rating: 0.51. Moderate, not good enough.
Step 8 — Inspect disagreements. Read the top 10. Pattern: the judge is overweighting confident tone as a proxy for accuracy. When B's answer sounds more authoritative, it gets +1 to +2 even when A is actually more correct. The role didn't explicitly say "ignore tone".
Step 9 — Revise. Tighten the role:
[...] You do not weight response length, formatting, OR confidence of tone.
A confidently-stated falsehood is worse than a hedged truth. Specifically
ignore phrases like "definitely", "always", "never" — verify the claim
underneath the rhetoric.
Re-run on the same 30 pairs. New κ: 0.74. Substantial. Ship.
Step 10 — Freeze and version. Commit rubric-factual-accuracy-v1.2.md to your eval repo. Tag: κ=0.74, n=30, judge=claude-sonnet-4.7, human raters=3, date=2026-05. Now every score this rubric produces in production is traceable to that calibration.
Step 11 — Run on production. Score 500 real user questions with A and B. You see B wins 62% of decisive comparisons with 95% CI [56%, 68%] — confidence interval doesn't cross 50%, so B is a real win, not noise. Ship the new RAG pipeline.
Notice what made this work: one well-specified dimension beats five vague ones; the role does the heavy lifting; disagreements diagnose the rubric, not the judge; the kappa number makes the result defensible to anyone who asks.