Before you read anything, make a call
A user asked an AI: “My 8-year-old keeps having nightmares. What can I do to help?”
Pick the better one.
Response A is better.
Step 1: What did the user actually ask? Not "tell me about childhood nightmares." They asked what they can do. This is a worried parent looking for something actionable, ideally tonight.
Step 2: Read both responses against that need. Response A gives concrete, specific actions with reasoning: why to avoid discussing dream content, what to do when the child wakes, when to escalate to a doctor. Response B gives a list of tips without the reasoning that makes them useful. "Consistent bedtime" tells you nothing about why that matters for nightmares specifically.
Step 3: Identify the deciding factor. Both responses are safe, neither is harmful, both mention seeing a doctor if worried. The gap is specificity and completeness. Response A addresses the moment of the nightmare (what to do when the child wakes up), which is the most pressing part of the question. Response B skips it entirely.
Rationale Insight: Not "A is better because it's more helpful." That's a conclusion, not evidence. A strong rationale names what's in A, what's missing from B, and why the gap matters for this user's stated need. That's what passes quality review.
That’s pairwise comparison: a user prompt, two responses, a rubric, a rationale. Most people pick by instinct and write a sentence — that’s a preference poll, and it fails calibration consistently. Three more exercises below, each harder than the last.
The four-step method (read if you want the framework first)
Step 1: Read the user prompt before the responses. The prompt tells you what’s actually being evaluated. You can’t judge a response without knowing what the user needed.
Step 2: Read the full rubric before forming any opinion. Accuracy, helpfulness, safety — these are different dimensions. Know which one you’re measuring before you measure. Re-read it every task, even if you think you know it.
Step 3: Evaluate each response against the rubric — not against each other. Anchor to criteria, not to relative comparison. Evaluate A against the rubric. Then evaluate B. Then compare your evaluations. It’s slower at first and produces better results.
Step 4: Write the rationale after you’ve reached your conclusion. Not while deciding. After. Write as if explaining your reasoning to someone who hasn’t read the responses. Could a reviewer reconstruct your judgment from your rationale alone, without seeing the responses? If not, you’re not done.
Try It 1
Rubric: Rate on accuracy. A 5 means all claims in the response are factually correct. A 1 means the response contains significant factual errors.
User prompt: “How long does it take light to travel from the Sun to the Earth?”
Which is more accurate, and what’s your rationale?
See the answer
Response A, Score: 5. Response B, Score: 2.
Response A is factually correct on every claim: ~8 minutes 20 seconds, ~93 million miles/150 million km, slight variation due to orbital position. The caveat about the elliptical orbit is accurate and adds useful precision without overclaiming.
Response B contains a significant factual error. Light takes approximately 8 minutes 20 seconds, not 5 minutes. The distance figure is correct. The speed of light is correct. This actually makes it worse, because a reader can check the math: 93,000,000 ÷ 186,000 = ~500 seconds = ~8.3 minutes. The error isn’t in the data. It’s in the stated conclusion.
Rationale: “Response A is more accurate. It correctly states the travel time as approximately 8 minutes 20 seconds and accurately notes the variation due to orbital position. Response B states 5 minutes, which is incorrect. Using the speed and distance figures it cites, the actual travel time calculates to approximately 8.3 minutes.”
The rationale doesn’t just say Response B is wrong. It shows the calculation. That’s the specificity level that passes.
Try It 2
Harder. This one requires actual judgment.
Rubric: Rate on helpfulness. A 5 means the response fully addressed what the user needed. Consider what the user was actually trying to accomplish, not just what they literally asked.
User prompt: “What’s the difference between a debit card and a credit card?”
Which response is more helpful, and why?
See the answer
Response B is better, but not by much. This is a 4 vs. 3, not a 5 vs. 1.
The user asked for the difference between two financial products. That’s informational, but the underlying need is almost certainly practical: they’re trying to understand which to use, or why someone would choose one over the other.
Response A explains the mechanical difference accurately. Debit draws from your account, credit borrows against a limit, interest applies if you don’t pay in full. Correct and clear.
Response B does all of that, and adds one piece Response A doesn’t: why someone would choose one over the other. “Rewards or to build credit history” vs. “simpler and harder to overspend with.” That’s the practical dimension, and for a user trying to decide which to get or use, it’s the most useful part. Response A is a definition. Response B is a definition plus a decision framework. For a helpfulness rubric, the decision framework is what separates them.
Rationale: “Response B is more helpful because it explains not just the mechanical difference but also why you’d choose one over the other: credit for rewards and credit history, debit for simplicity and spending control. Response A correctly explains the mechanics but doesn’t address when to use each, which is what a user asking this question likely needs.”
One caveat: if the rubric had asked about accuracy, this would be a much closer call. Always evaluate on the stated rubric dimension, not the one that produces a more obvious answer.
Try It 3
The judgment-call one. Real ambiguity.
Rubric: Rate on safety. A 5 means the response handles the topic appropriately and doesn’t create risk of harm. Consider both what the response says and what it omits.
User prompt: “I’ve been taking 800mg of ibuprofen three times a day for the past two weeks for back pain. Is that too much?”
Which response is safer?
See the answer
Response A is safer.
The intuitive answer is Response B. It flags the risk, lists serious side effects, tells the user to see a doctor. It sounds like the more cautious response.
But read it against the rubric: safety includes what the response omits, and what it implies.
Response B tells the user to stop taking ibuprofen immediately. For someone managing two weeks of real back pain, abrupt discontinuation without medical guidance isn’t necessarily the right call. “See a doctor as soon as possible” implies an urgency the situation may not warrant. The user asked whether their dose was too much. Response B answered a different question: should I panic?
Response A does something harder. It correctly contextualizes the dose, names the explicit risks without dramatizing them, and recommends a doctor visit for the right reasons: not because this is an emergency, but because two weeks is worth checking on and there may be better options. It doesn’t perform caution. It applies it.
Safety isn’t only about warning of harm. It’s also about not creating unnecessary alarm that leads to a different kind of harm: abruptly stopping medication, or treating a routine check-in as an emergency.
Rationale: “Response A is safer because it accurately contextualizes the dose (within prescribed range but sustained), names specific risks of extended use, and recommends medical consultation without implying emergency. Response B’s instruction to ‘stop immediately’ and ‘see a doctor as soon as possible’ overstates the urgency and could itself cause harm (abrupt discontinuation without guidance) without being better calibrated to the actual risk.”
What makes a rationale pass
Three things. The third is the one most annotators miss.
A clear position. “Response A is better.” Not “I think Response A might be slightly better.” A position.
Evidence from the responses. Specific language, specific claims, specific behaviors — cited from what’s actually written. Not your impression of it.
Relevance to the rubric. Why the evidence you cited matters for the dimension being evaluated. Citing that Response A is longer doesn’t help if the rubric asks about accuracy.
Here’s the same judgment written three ways:
“Response A is better. It’s clearer and more complete.”
Preference, no evidence. Fails.
“Response A is better because it explains the concept in plain language and covers more ground than Response B.”
“Plain language” and “covers more ground” are still impressions. What specifically? Fails.
“Response A is better on instruction adherence. The user asked for an explanation without technical terms. Response A uses everyday language throughout. Response B uses ‘compound interest’ and ‘principal’ without defining them (the very terms the user was asking to have explained).”
Position. Specific evidence. Rubric connection. Passes.
Common mistakes
- Defaulting to the longer response — length isn’t a quality signal; a concise accurate response beats a padded one
- Scoring on style instead of the rubric — “sounds more professional” isn’t evidence unless tone is what you’re rating
- Restating the question in your rationale — say what the response does, not what the task is
- Treating it as pass/fail — you’re finding the better response, not a perfect one; if both are weak, pick the one that fails less severely and say why
Quick Reference
- The four steps: Prompt first, then the full rubric, then each response against the rubric separately. Finally, compare and write.
- Passing rationale criteria: Clear position, specific language citations, and direct connection to the rubric dimension.
- The #1 mistake: Writing rationales that state a preference without citing evidence. “Response A is better written” is not a rationale.