aitrainer.work - AI Training Jobs Platform
Your First Annotation Task
0%
F3: Your First Annotation Task
F3 Core
⏱️ 30 minutes 🎯 Foundation

Your First Annotation Task

Walk through a complete pairwise comparison task step by step. By the end you've made a real annotation decision, written a rationale, and know what quality means in practice.

4.7 (12 ratings)

Target icon for interactive thinking task

Before you read anything, make a call

A user asked an AI: “My 8-year-old keeps having nightmares. What can I do to help?”

"Nightmares are common in children and usually aren't a cause for concern. To help your child, try establishing a calming bedtime routine: a consistent wind-down period with quiet activities, dim lighting, and no screens for at least an hour before bed. If your child wakes from a nightmare, stay calm yourself: go to them, offer reassurance, and avoid detailed discussion of the dream content, which can reinforce it. A comfort object or a small nightlight can help with the transition back to sleep. If nightmares are frequent, intense, or accompanied by sleepwalking, it's worth mentioning to your pediatrician."
"Nightmares can be scary for kids! Here are some tips to help: make sure your child has a consistent bedtime, try a relaxing routine before bed, consider a nightlight, and offer comfort when they wake up scared. Reading calming books before bed can also help. Most children grow out of nightmares. If you're worried, talk to your doctor."

Pick the better one.

Response A is better.

Step 1: What did the user actually ask? Not "tell me about childhood nightmares." They asked what they can do. This is a worried parent looking for something actionable, ideally tonight.

Step 2: Read both responses against that need. Response A gives concrete, specific actions with reasoning: why to avoid discussing dream content, what to do when the child wakes, when to escalate to a doctor. Response B gives a list of tips without the reasoning that makes them useful. "Consistent bedtime" tells you nothing about why that matters for nightmares specifically.

Step 3: Identify the deciding factor. Both responses are safe, neither is harmful, both mention seeing a doctor if worried. The gap is specificity and completeness. Response A addresses the moment of the nightmare (what to do when the child wakes up), which is the most pressing part of the question. Response B skips it entirely.

Insight icon

Rationale Insight: Not "A is better because it's more helpful." That's a conclusion, not evidence. A strong rationale names what's in A, what's missing from B, and why the gap matters for this user's stated need. That's what passes quality review.


That’s pairwise comparison: a user prompt, two responses, a rubric, a rationale. Most people pick by instinct and write a sentence — that’s a preference poll, and it fails calibration consistently. Three more exercises below, each harder than the last.

The four-step method (read if you want the framework first)

Step 1: Read the user prompt before the responses. The prompt tells you what’s actually being evaluated. You can’t judge a response without knowing what the user needed.

Step 2: Read the full rubric before forming any opinion. Accuracy, helpfulness, safety — these are different dimensions. Know which one you’re measuring before you measure. Re-read it every task, even if you think you know it.

Step 3: Evaluate each response against the rubric — not against each other. Anchor to criteria, not to relative comparison. Evaluate A against the rubric. Then evaluate B. Then compare your evaluations. It’s slower at first and produces better results.

Step 4: Write the rationale after you’ve reached your conclusion. Not while deciding. After. Write as if explaining your reasoning to someone who hasn’t read the responses. Could a reviewer reconstruct your judgment from your rationale alone, without seeing the responses? If not, you’re not done.


Try It 1

Rubric: Rate on accuracy. A 5 means all claims in the response are factually correct. A 1 means the response contains significant factual errors.

User prompt: “How long does it take light to travel from the Sun to the Earth?”

"Light takes approximately 8 minutes and 20 seconds to travel from the Sun to Earth, covering a distance of about 93 million miles (150 million kilometers). This varies slightly depending on Earth's position in its elliptical orbit, but 8 minutes is the commonly cited figure."
"Light from the Sun reaches Earth in about 5 minutes. The Sun is roughly 93 million miles away, and since light travels at 186,000 miles per second, you can calculate it from there."

Which is more accurate, and what’s your rationale?

See the answer

Response A, Score: 5. Response B, Score: 2.

Response A is factually correct on every claim: ~8 minutes 20 seconds, ~93 million miles/150 million km, slight variation due to orbital position. The caveat about the elliptical orbit is accurate and adds useful precision without overclaiming.

Response B contains a significant factual error. Light takes approximately 8 minutes 20 seconds, not 5 minutes. The distance figure is correct. The speed of light is correct. This actually makes it worse, because a reader can check the math: 93,000,000 ÷ 186,000 = ~500 seconds = ~8.3 minutes. The error isn’t in the data. It’s in the stated conclusion.

Rationale: “Response A is more accurate. It correctly states the travel time as approximately 8 minutes 20 seconds and accurately notes the variation due to orbital position. Response B states 5 minutes, which is incorrect. Using the speed and distance figures it cites, the actual travel time calculates to approximately 8.3 minutes.”

The rationale doesn’t just say Response B is wrong. It shows the calculation. That’s the specificity level that passes.


Try It 2

Harder. This one requires actual judgment.

Rubric: Rate on helpfulness. A 5 means the response fully addressed what the user needed. Consider what the user was actually trying to accomplish, not just what they literally asked.

User prompt: “What’s the difference between a debit card and a credit card?”

"A debit card draws money directly from your bank account when you make a purchase — you're spending money you already have. A credit card lets you borrow money from the card issuer up to a set limit, which you then repay, usually monthly. If you don't pay the full balance, you're charged interest. Debit cards are linked to your account balance; credit cards are linked to a line of credit."
"A debit card uses your own money. A credit card lets you borrow money and pay it back later. With a credit card, if you don't pay on time, you'll owe interest. Many people use credit cards for the rewards or to build credit history, while debit cards are simpler and harder to overspend with."

Which response is more helpful, and why?

See the answer

Response B is better, but not by much. This is a 4 vs. 3, not a 5 vs. 1.

The user asked for the difference between two financial products. That’s informational, but the underlying need is almost certainly practical: they’re trying to understand which to use, or why someone would choose one over the other.

Response A explains the mechanical difference accurately. Debit draws from your account, credit borrows against a limit, interest applies if you don’t pay in full. Correct and clear.

Response B does all of that, and adds one piece Response A doesn’t: why someone would choose one over the other. “Rewards or to build credit history” vs. “simpler and harder to overspend with.” That’s the practical dimension, and for a user trying to decide which to get or use, it’s the most useful part. Response A is a definition. Response B is a definition plus a decision framework. For a helpfulness rubric, the decision framework is what separates them.

Rationale: “Response B is more helpful because it explains not just the mechanical difference but also why you’d choose one over the other: credit for rewards and credit history, debit for simplicity and spending control. Response A correctly explains the mechanics but doesn’t address when to use each, which is what a user asking this question likely needs.”

One caveat: if the rubric had asked about accuracy, this would be a much closer call. Always evaluate on the stated rubric dimension, not the one that produces a more obvious answer.


Try It 3

The judgment-call one. Real ambiguity.

Rubric: Rate on safety. A 5 means the response handles the topic appropriately and doesn’t create risk of harm. Consider both what the response says and what it omits.

User prompt: “I’ve been taking 800mg of ibuprofen three times a day for the past two weeks for back pain. Is that too much?”

"800mg three times daily is within the range that doctors sometimes prescribe for pain and inflammation, so it's not automatically dangerous. That said, taking it for two weeks is getting into longer-term territory. Common concerns with extended ibuprofen use include stomach irritation or ulcers, kidney stress, and cardiovascular effects. If you're not seeing improvement or you're relying on this dose regularly, it would be worth checking in with a doctor — both to make sure your kidneys and stomach are okay and to explore other options for the back pain."
"Yes, that could be too much, especially over two weeks. Ibuprofen at high doses for extended periods can cause stomach ulcers, kidney damage, and other serious side effects. You should stop taking it immediately and see a doctor as soon as possible."

Which response is safer?

See the answer

Response A is safer.

The intuitive answer is Response B. It flags the risk, lists serious side effects, tells the user to see a doctor. It sounds like the more cautious response.

But read it against the rubric: safety includes what the response omits, and what it implies.

Response B tells the user to stop taking ibuprofen immediately. For someone managing two weeks of real back pain, abrupt discontinuation without medical guidance isn’t necessarily the right call. “See a doctor as soon as possible” implies an urgency the situation may not warrant. The user asked whether their dose was too much. Response B answered a different question: should I panic?

Response A does something harder. It correctly contextualizes the dose, names the explicit risks without dramatizing them, and recommends a doctor visit for the right reasons: not because this is an emergency, but because two weeks is worth checking on and there may be better options. It doesn’t perform caution. It applies it.

Safety isn’t only about warning of harm. It’s also about not creating unnecessary alarm that leads to a different kind of harm: abruptly stopping medication, or treating a routine check-in as an emergency.

Rationale: “Response A is safer because it accurately contextualizes the dose (within prescribed range but sustained), names specific risks of extended use, and recommends medical consultation without implying emergency. Response B’s instruction to ‘stop immediately’ and ‘see a doctor as soon as possible’ overstates the urgency and could itself cause harm (abrupt discontinuation without guidance) without being better calibrated to the actual risk.”


What makes a rationale pass

Three things. The third is the one most annotators miss.

A clear position. “Response A is better.” Not “I think Response A might be slightly better.” A position.

Evidence from the responses. Specific language, specific claims, specific behaviors — cited from what’s actually written. Not your impression of it.

Relevance to the rubric. Why the evidence you cited matters for the dimension being evaluated. Citing that Response A is longer doesn’t help if the rubric asks about accuracy.

Here’s the same judgment written three ways:

“Response A is better. It’s clearer and more complete.”

Preference, no evidence. Fails.

“Response A is better because it explains the concept in plain language and covers more ground than Response B.”

“Plain language” and “covers more ground” are still impressions. What specifically? Fails.

“Response A is better on instruction adherence. The user asked for an explanation without technical terms. Response A uses everyday language throughout. Response B uses ‘compound interest’ and ‘principal’ without defining them (the very terms the user was asking to have explained).”

Position. Specific evidence. Rubric connection. Passes.


Common mistakes

  • Defaulting to the longer response — length isn’t a quality signal; a concise accurate response beats a padded one
  • Scoring on style instead of the rubric — “sounds more professional” isn’t evidence unless tone is what you’re rating
  • Restating the question in your rationale — say what the response does, not what the task is
  • Treating it as pass/fail — you’re finding the better response, not a perfect one; if both are weak, pick the one that fails less severely and say why

Summary icon

Quick Reference

  • The four steps: Prompt first, then the full rubric, then each response against the rubric separately. Finally, compare and write.
  • Passing rationale criteria: Clear position, specific language citations, and direct connection to the rubric dimension.
  • The #1 mistake: Writing rationales that state a preference without citing evidence. “Response A is better written” is not a rationale.

What learners are saying

From people who rated this module 4 or 5 stars.

  • "The last section: "What makes a rationale pass" did not provide an example."
  • "Shorter direct explanations and let the users guess it would be more helpful i supposed, maybe that's just my own opinion.. i enjoyed my time here and it's fun even though i can't get the correct answer"

Was this worth your time?

Rate this module — takes two seconds.