Before you read anything, make a call
A user asked an AI: “My 8-year-old keeps having nightmares. What can I do to help?”
Pick the better one.
Response A is better.
Step 1: What did the user actually ask? Not "tell me about childhood nightmares." They asked what they can do. This is a worried parent looking for something actionable, ideally tonight.
Step 2: Read both responses against that need. Response A gives concrete, specific actions with reasoning: why to avoid discussing dream content, what to do when the child wakes, when to escalate to a doctor. Response B gives a list of tips without the reasoning that makes them useful. "Consistent bedtime" tells you nothing about why that matters for nightmares specifically.
Step 3: Identify the deciding factor. Both responses are safe, neither is harmful, both mention seeing a doctor if worried. The gap is specificity and completeness. Response A addresses the moment of the nightmare (what to do when the child wakes up), which is the most pressing part of the question. Response B skips it entirely.
Rationale Insight: Not "A is better because it's more helpful." That's a conclusion, not evidence. A strong rationale names what's in A, what's missing from B, and why the gap matters for this user's stated need. That's what passes quality review.
Most people treat this like a preference poll
Pairwise comparison is the first task type almost every new annotator encounters. The setup is always the same: a user prompt, two AI responses, a rubric, and a box for your rationale. Your job is to decide which response better meets the rubric criteria and explain why in writing.
Most people read both responses, pick the one that feels better, and write a sentence. That’s a preference poll. It fails calibration (not occasionally, but consistently) because it skips the steps that make your judgment reliable and your rationale usable.
This module walks through a complete annotation task the way an experienced annotator works it. Not the concept. The actual process, with real decisions at each stage.
The four steps
Every pairwise comparison, regardless of topic or platform, follows the same cognitive sequence. Experienced annotators do this automatically. When you’re starting out, do it deliberately.
Step 1: Read the user prompt before the responses
The user prompt tells you what the task is actually evaluating. A response that’s accurate but doesn’t answer the question asked is a bad response. A response that’s slightly informal but directly addresses the user’s need may be a good one. You can’t judge either of those things if you read the responses first.
Read the prompt. Understand what the user needed. Then read the responses.
Step 2: Read the full rubric before forming any opinion
Every task comes with evaluation criteria. The rubric might say “rate on accuracy” or “rate on helpfulness” or “rate on instruction adherence.” Those are different things.
A response that scores well on accuracy might score poorly on instruction adherence. You need to know which one you’re being asked to measure before you start measuring.
Warning: Forming an opinion before reading the rubric anchors your judgment. Re-read the rubric for every single task, even if you think you know it by heart.
Step 3: Evaluate each response against the rubric — not against each other
The instinct is to compare directly: A has this, B has that, A wins. That instinct produces inconsistent results, because it anchors your judgment to one response’s relative strengths rather than to the criteria you were actually given.
Instead: evaluate Response A against the rubric. What does it do well? Where does it fall short? Note it. Then evaluate Response B against the rubric. Same questions. Now compare your evaluations.
It’s slower at first. It produces better rationales, higher calibration scores, and fewer reversals when you review your work.
Step 4: Write the rationale after you’ve reached your conclusion
Not while you’re deciding. After. Write as if explaining your reasoning to someone who hasn’t read the responses.
The test: could a reviewer reconstruct your judgment from your rationale alone, without seeing the responses? If not, you’re not done.
What a rationale actually needs
Three things. The third is the one most annotators miss.
A clear position. “Response A is better.” Not “I think Response A might be slightly better.” A position.
Evidence from the responses. Specific language, specific claims, specific behaviors — cited from what’s actually written. Not your impression of it. Not a paraphrase. What’s there.
Relevance to the rubric. Why the evidence you cited matters for the dimension being evaluated. Citing that Response A is longer doesn’t help if the rubric asks about accuracy.
Here’s the same judgment written three ways:
“Response A is better. It’s clearer and more complete.”
A reviewer reads this and logs: preference, no evidence. They can’t tell if you read the responses or guessed. Fails.
“Response A is better because it explains the concept in plain language and covers more ground than Response B.”
Closer, but “plain language” and “covers more ground” are still impressions. What specifically is plain? What ground is covered? A reviewer still can’t tell what you actually read. Fails.
“Response A is better on instruction adherence. The user asked for an explanation without technical terms. Response A uses everyday language throughout. Response B uses ‘compound interest’ and ‘principal’ without defining them (the very terms the user was asking to have explained).”
Position. Specific evidence. Rubric connection. That passes.
Common first-task mistakes
Defaulting to the longer response
Length is not a quality signal. Platforms know this and watch for it: annotators who consistently favor longer responses are flagging a calibration problem. A concise, accurate response that directly answers the question beats a long one that pads around it.
If you’re unsure which response is better, and one is longer, that uncertainty is not a reason to pick the longer one.
Scoring on writing style instead of the rubric dimension
“Response A sounds more professional.” Unless the rubric asks about tone, this is the wrong dimension. When you catch yourself thinking “this one just sounds better,” stop. Go back to the rubric. What specifically does it ask you to evaluate? Score that.
Writing rationales that restate the question
“The user asked about photosynthesis, and Response A explains photosynthesis well.”
A reviewer reads this and has no idea why you chose A. You’ve described what the task is, not what A does. The rationale needs to say what Response A does, what Response B does or doesn’t do by comparison, and why that gap matters for the rubric criterion.
Treating the task as pass/fail
Pairwise comparison doesn’t ask you to find a perfect response. It asks you to find the better one. Both responses can be mediocre. Both can have problems. Your job is to identify which one is less bad and explain why.
If both responses are weak, the answer isn’t “neither is good.” It’s “Response A, because it fails less severely on the rubric dimension than Response B does.” Pick the better one, note its limitations if relevant, and move on.
Try It 1
Straightforward one first. The task:
Rubric: Rate on accuracy. A 5 means all claims in the response are factually correct. A 1 means the response contains significant factual errors.
User prompt: “How long does it take light to travel from the Sun to the Earth?”
Which is more accurate, and what’s your rationale?
See the answer
Response A, Score: 5. Response B, Score: 2.
Response A is factually correct on every claim: ~8 minutes 20 seconds, ~93 million miles/150 million km, slight variation due to orbital position. The caveat about the elliptical orbit is accurate and adds useful precision without overclaiming.
Response B contains a significant factual error. Light takes approximately 8 minutes 20 seconds, not 5 minutes. The distance figure is correct. The speed of light is correct. This actually makes it worse, because a reader can check the math: 93,000,000 ÷ 186,000 = ~500 seconds = ~8.3 minutes. The error isn’t in the data. It’s in the stated conclusion.
Rationale: “Response A is more accurate. It correctly states the travel time as approximately 8 minutes 20 seconds and accurately notes the variation due to orbital position. Response B states 5 minutes, which is incorrect. Using the speed and distance figures it cites, the actual travel time calculates to approximately 8.3 minutes.”
The rationale doesn’t just say Response B is wrong. It shows the calculation. That’s the specificity level that passes.
Try It 2
Harder. This one requires actual judgment.
Rubric: Rate on helpfulness. A 5 means the response fully addressed what the user needed. Consider what the user was actually trying to accomplish, not just what they literally asked.
User prompt: “What’s the difference between a debit card and a credit card?”
Which response is more helpful, and why?
See the answer
Response B is better, but not by much. This is a 4 vs. 3, not a 5 vs. 1.
The user asked for the difference between two financial products. That’s informational, but the underlying need is almost certainly practical: they’re trying to understand which to use, or why someone would choose one over the other.
Response A explains the mechanical difference accurately. Debit draws from your account, credit borrows against a limit, interest applies if you don’t pay in full. Correct and clear.
Response B does all of that, and adds one piece Response A doesn’t: why someone would choose one over the other. “Rewards or to build credit history” vs. “simpler and harder to overspend with.” That’s the practical dimension, and for a user trying to decide which to get or use, it’s the most useful part. Response A is a definition. Response B is a definition plus a decision framework. For a helpfulness rubric, the decision framework is what separates them.
Rationale: “Response B is more helpful because it explains not just the mechanical difference but also why you’d choose one over the other: credit for rewards and credit history, debit for simplicity and spending control. Response A correctly explains the mechanics but doesn’t address when to use each, which is what a user asking this question likely needs.”
One caveat: if the rubric had asked about accuracy, this would be a much closer call. Always evaluate on the stated rubric dimension, not the one that produces a more obvious answer.
Try It 3
The judgment-call one. Real ambiguity.
Rubric: Rate on safety. A 5 means the response handles the topic appropriately and doesn’t create risk of harm. Consider both what the response says and what it omits.
User prompt: “I’ve been taking 800mg of ibuprofen three times a day for the past two weeks for back pain. Is that too much?”
Which response is safer?
See the answer
Response A is safer.
The intuitive answer is Response B. It flags the risk, lists serious side effects, tells the user to see a doctor. It sounds like the more cautious response.
But read it against the rubric: safety includes what the response omits, and what it implies.
Response B tells the user to stop taking ibuprofen immediately. For someone managing two weeks of real back pain, abrupt discontinuation without medical guidance isn’t necessarily the right call. “See a doctor as soon as possible” implies an urgency the situation may not warrant. The user asked whether their dose was too much. Response B answered a different question: should I panic?
Response A does something harder. It correctly contextualizes the dose, names the explicit risks without dramatizing them, and recommends a doctor visit for the right reasons: not because this is an emergency, but because two weeks is worth checking on and there may be better options. It doesn’t perform caution. It applies it.
Safety isn’t only about warning of harm. It’s also about not creating unnecessary alarm that leads to a different kind of harm: abruptly stopping medication, or treating a routine check-in as an emergency.
Rationale: “Response A is safer because it accurately contextualizes the dose (within prescribed range but sustained), names specific risks of extended use, and recommends medical consultation without implying emergency. Response B’s instruction to ‘stop immediately’ and ‘see a doctor as soon as possible’ overstates the urgency and could itself cause harm (abrupt discontinuation without guidance) without being better calibrated to the actual risk.”
Quick Reference
- The four steps: Prompt first, then the full rubric, then each response against the rubric separately. Finally, compare and write.
- Passing rationale criteria: Clear position, specific language citations, and direct connection to the rubric dimension.
- The #1 mistake: Writing rationales that state a preference without citing evidence. “Response A is better written” is not a rationale.