Target icon for interactive thinking task

Before you read anything, make a call

A user asked an AI: “My 8-year-old keeps having nightmares. What can I do to help?”

"Nightmares are common in children and usually aren't a cause for concern. To help your child, try establishing a calming bedtime routine: a consistent wind-down period with quiet activities, dim lighting, and no screens for at least an hour before bed. If your child wakes from a nightmare, stay calm yourself: go to them, offer reassurance, and avoid detailed discussion of the dream content, which can reinforce it. A comfort object or a small nightlight can help with the transition back to sleep. If nightmares are frequent, intense, or accompanied by sleepwalking, it's worth mentioning to your pediatrician."

"Nightmares can be scary for kids! Here are some tips to help: make sure your child has a consistent bedtime, try a relaxing routine before bed, consider a nightlight, and offer comfort when they wake up scared. Reading calming books before bed can also help. Most children grow out of nightmares. If you're worried, talk to your doctor."

Pick the better one.

Response A is better.

Step 1: What did the user actually ask? Not "tell me about childhood nightmares." They asked what they can do. This is a worried parent looking for something actionable, ideally tonight.

Step 2: Read both responses against that need. Response A gives concrete, specific actions with reasoning: why to avoid discussing dream content, what to do when the child wakes, when to escalate to a doctor. Response B gives a list of tips without the reasoning that makes them useful. "Consistent bedtime" tells you nothing about why that matters for nightmares specifically.

Step 3: Identify the deciding factor. Both responses are safe, neither is harmful, both mention seeing a doctor if worried. The gap is specificity and completeness. Response A addresses the moment of the nightmare (what to do when the child wakes up), which is the most pressing part of the question. Response B skips it entirely.

Rationale Insight: Not "A is better because it's more helpful." That's a conclusion, not evidence. A strong rationale names what's in A, what's missing from B, and why the gap matters for this user's stated need. That's what passes quality review.

Most people treat this like a preference poll

Pairwise comparison is the first task type almost every new annotator encounters. The setup is always the same: a user prompt, two AI responses, a rubric, and a box for your rationale. Your job is to decide which response better meets the rubric criteria and explain why in writing.

Most people read both responses, pick the one that feels better, and write a sentence. That’s a preference poll. It fails calibration (not occasionally, but consistently) because it skips the steps that make your judgment reliable and your rationale usable.

This module walks through a complete annotation task the way an experienced annotator works it. Not the concept. The actual process, with real decisions at each stage.

The four steps

Every pairwise comparison, regardless of topic or platform, follows the same cognitive sequence. Experienced annotators do this automatically. When you’re starting out, do it deliberately.

Step 1: Read the user prompt before the responses

The user prompt tells you what the task is actually evaluating. A response that’s accurate but doesn’t answer the question asked is a bad response. A response that’s slightly informal but directly addresses the user’s need may be a good one. You can’t judge either of those things if you read the responses first.

Read the prompt. Understand what the user needed. Then read the responses.

Step 2: Read the full rubric before forming any opinion

Every task comes with evaluation criteria. The rubric might say “rate on accuracy” or “rate on helpfulness” or “rate on instruction adherence.” Those are different things.

A response that scores well on accuracy might score poorly on instruction adherence. You need to know which one you’re being asked to measure before you start measuring.

Warning: Forming an opinion before reading the rubric anchors your judgment. Re-read the rubric for every single task, even if you think you know it by heart.

Step 3: Evaluate each response against the rubric — not against each other

The instinct is to compare directly: A has this, B has that, A wins. That instinct produces inconsistent results, because it anchors your judgment to one response’s relative strengths rather than to the criteria you were actually given.

Instead: evaluate Response A against the rubric. What does it do well? Where does it fall short? Note it. Then evaluate Response B against the rubric. Same questions. Now compare your evaluations.

It’s slower at first. It produces better rationales, higher calibration scores, and fewer reversals when you review your work.

Step 4: Write the rationale after you’ve reached your conclusion

Not while you’re deciding. After. Write as if explaining your reasoning to someone who hasn’t read the responses.

The test: could a reviewer reconstruct your judgment from your rationale alone, without seeing the responses? If not, you’re not done.

What a rationale actually needs

Three things. The third is the one most annotators miss.

A clear position. “Response A is better.” Not “I think Response A might be slightly better.” A position.

Evidence from the responses. Specific language, specific claims, specific behaviors — cited from what’s actually written. Not your impression of it. Not a paraphrase. What’s there.

Relevance to the rubric. Why the evidence you cited matters for the dimension being evaluated. Citing that Response A is longer doesn’t help if the rubric asks about accuracy.

Here’s the same judgment written three ways:

“Response A is better. It’s clearer and more complete.”

A reviewer reads this and logs: preference, no evidence. They can’t tell if you read the responses or guessed. Fails.

“Response A is better because it explains the concept in plain language and covers more ground than Response B.”

Closer, but “plain language” and “covers more ground” are still impressions. What specifically is plain? What ground is covered? A reviewer still can’t tell what you actually read. Fails.

“Response A is better on instruction adherence. The user asked for an explanation without technical terms. Response A uses everyday language throughout. Response B uses ‘compound interest’ and ‘principal’ without defining them (the very terms the user was asking to have explained).”

Position. Specific evidence. Rubric connection. That passes.

Common first-task mistakes

Defaulting to the longer response

Length is not a quality signal. Platforms know this and watch for it: annotators who consistently favor longer responses are flagging a calibration problem. A concise, accurate response that directly answers the question beats a long one that pads around it.

If you’re unsure which response is better, and one is longer, that uncertainty is not a reason to pick the longer one.

Scoring on writing style instead of the rubric dimension

“Response A sounds more professional.” Unless the rubric asks about tone, this is the wrong dimension. When you catch yourself thinking “this one just sounds better,” stop. Go back to the rubric. What specifically does it ask you to evaluate? Score that.

Writing rationales that restate the question

“The user asked about photosynthesis, and Response A explains photosynthesis well.”

A reviewer reads this and has no idea why you chose A. You’ve described what the task is, not what A does. The rationale needs to say what Response A does, what Response B does or doesn’t do by comparison, and why that gap matters for the rubric criterion.

Treating the task as pass/fail

Pairwise comparison doesn’t ask you to find a perfect response. It asks you to find the better one. Both responses can be mediocre. Both can have problems. Your job is to identify which one is less bad and explain why.

If both responses are weak, the answer isn’t “neither is good.” It’s “Response A, because it fails less severely on the rubric dimension than Response B does.” Pick the better one, note its limitations if relevant, and move on.

Try It 1

Straightforward one first. The task:

Rubric: Rate on accuracy. A 5 means all claims in the response are factually correct. A 1 means the response contains significant factual errors.

User prompt: “How long does it take light to travel from the Sun to the Earth?”

"Light takes approximately 8 minutes and 20 seconds to travel from the Sun to Earth, covering a distance of about 93 million miles (150 million kilometers). This varies slightly depending on Earth's position in its elliptical orbit, but 8 minutes is the commonly cited figure."

"Light from the Sun reaches Earth in about 5 minutes. The Sun is roughly 93 million miles away, and since light travels at 186,000 miles per second, you can calculate it from there."

Which is more accurate, and what’s your rationale?

See the answer

Response A, Score: 5. Response B, Score: 2.

Response A is factually correct on every claim: ~8 minutes 20 seconds, ~93 million miles/150 million km, slight variation due to orbital position. The caveat about the elliptical orbit is accurate and adds useful precision without overclaiming.

Response B contains a significant factual error. Light takes approximately 8 minutes 20 seconds, not 5 minutes. The distance figure is correct. The speed of light is correct. This actually makes it worse, because a reader can check the math: 93,000,000 ÷ 186,000 = ~500 seconds = ~8.3 minutes. The error isn’t in the data. It’s in the stated conclusion.

Rationale: “Response A is more accurate. It correctly states the travel time as approximately 8 minutes 20 seconds and accurately notes the variation due to orbital position. Response B states 5 minutes, which is incorrect. Using the speed and distance figures it cites, the actual travel time calculates to approximately 8.3 minutes.”

The rationale doesn’t just say Response B is wrong. It shows the calculation. That’s the specificity level that passes.

Try It 2

Harder. This one requires actual judgment.

Rubric: Rate on helpfulness. A 5 means the response fully addressed what the user needed. Consider what the user was actually trying to accomplish, not just what they literally asked.

User prompt: “What’s the difference between a debit card and a credit card?”

"A debit card draws money directly from your bank account when you make a purchase — you're spending money you already have. A credit card lets you borrow money from the card issuer up to a set limit, which you then repay, usually monthly. If you don't pay the full balance, you're charged interest. Debit cards are linked to your account balance; credit cards are linked to a line of credit."

"A debit card uses your own money. A credit card lets you borrow money and pay it back later. With a credit card, if you don't pay on time, you'll owe interest. Many people use credit cards for the rewards or to build credit history, while debit cards are simpler and harder to overspend with."

Which response is more helpful, and why?

See the answer

Response B is better, but not by much. This is a 4 vs. 3, not a 5 vs. 1.

The user asked for the difference between two financial products. That’s informational, but the underlying need is almost certainly practical: they’re trying to understand which to use, or why someone would choose one over the other.

Response A explains the mechanical difference accurately. Debit draws from your account, credit borrows against a limit, interest applies if you don’t pay in full. Correct and clear.

Response B does all of that, and adds one piece Response A doesn’t: why someone would choose one over the other. “Rewards or to build credit history” vs. “simpler and harder to overspend with.” That’s the practical dimension, and for a user trying to decide which to get or use, it’s the most useful part. Response A is a definition. Response B is a definition plus a decision framework. For a helpfulness rubric, the decision framework is what separates them.

Rationale: “Response B is more helpful because it explains not just the mechanical difference but also why you’d choose one over the other: credit for rewards and credit history, debit for simplicity and spending control. Response A correctly explains the mechanics but doesn’t address when to use each, which is what a user asking this question likely needs.”

One caveat: if the rubric had asked about accuracy, this would be a much closer call. Always evaluate on the stated rubric dimension, not the one that produces a more obvious answer.

Try It 3

The judgment-call one. Real ambiguity.

Rubric: Rate on safety. A 5 means the response handles the topic appropriately and doesn’t create risk of harm. Consider both what the response says and what it omits.

User prompt: “I’ve been taking 800mg of ibuprofen three times a day for the past two weeks for back pain. Is that too much?”

"800mg three times daily is within the range that doctors sometimes prescribe for pain and inflammation, so it's not automatically dangerous. That said, taking it for two weeks is getting into longer-term territory. Common concerns with extended ibuprofen use include stomach irritation or ulcers, kidney stress, and cardiovascular effects. If you're not seeing improvement or you're relying on this dose regularly, it would be worth checking in with a doctor — both to make sure your kidneys and stomach are okay and to explore other options for the back pain."

"Yes, that could be too much, especially over two weeks. Ibuprofen at high doses for extended periods can cause stomach ulcers, kidney damage, and other serious side effects. You should stop taking it immediately and see a doctor as soon as possible."

Which response is safer?

See the answer

Response A is safer.

The intuitive answer is Response B. It flags the risk, lists serious side effects, tells the user to see a doctor. It sounds like the more cautious response.

But read it against the rubric: safety includes what the response omits, and what it implies.

Response B tells the user to stop taking ibuprofen immediately. For someone managing two weeks of real back pain, abrupt discontinuation without medical guidance isn’t necessarily the right call. “See a doctor as soon as possible” implies an urgency the situation may not warrant. The user asked whether their dose was too much. Response B answered a different question: should I panic?

Response A does something harder. It correctly contextualizes the dose, names the explicit risks without dramatizing them, and recommends a doctor visit for the right reasons: not because this is an emergency, but because two weeks is worth checking on and there may be better options. It doesn’t perform caution. It applies it.

Safety isn’t only about warning of harm. It’s also about not creating unnecessary alarm that leads to a different kind of harm: abruptly stopping medication, or treating a routine check-in as an emergency.

Rationale: “Response A is safer because it accurately contextualizes the dose (within prescribed range but sustained), names specific risks of extended use, and recommends medical consultation without implying emergency. Response B’s instruction to ‘stop immediately’ and ‘see a doctor as soon as possible’ overstates the urgency and could itself cause harm (abrupt discontinuation without guidance) without being better calibrated to the actual risk.”

Quick Reference

The four steps: Prompt first, then the full rubric, then each response against the rubric separately. Finally, compare and write.
Passing rationale criteria: Clear position, specific language citations, and direct connection to the rubric dimension.
The #1 mistake: Writing rationales that state a preference without citing evidence. “Response A is better written” is not a rationale.

Your First Annotation Task

Most people treat this like a preference poll

The four steps

Step 1: Read the user prompt before the responses

Step 2: Read the full rubric before forming any opinion

Step 3: Evaluate each response against the rubric — not against each other

Step 4: Write the rationale after you’ve reached your conclusion

What a rationale actually needs

Common first-task mistakes

Defaulting to the longer response

Scoring on writing style instead of the rubric dimension

Writing rationales that restate the question

Treating the task as pass/fail

Try It 1

Try It 2

Try It 3

Quick Reference

Was this worth your time?

$150–$225/hr. Lawyers, MDs and Finance Experts Wanted.

Get Paid for the Expertise You Already Have