Target icon for interactive thinking task

Before you read anything, try this first

A company is reviewing two annotators’ responses to the same task. The task was: rate which of two AI responses better answers the question “What causes inflation?”

"Response 1 is better. It explains the topic more clearly and is easier to understand."

"Response 1 is better. It correctly identifies both demand-pull and cost-push inflation as causes and gives a concrete example of each. Response 2 only mentions 'printing money,' which is one mechanism but not a complete answer."

Pick the better annotator.

Annotator B, by a wide margin.

Here's what a reviewer sees when they read Annotator A's rationale: "This person has a preference but can't explain it." Words like "clearer" and "easier to understand" are judgments without evidence. They don't tell the platform anything useful about whether the annotator understood the content.

Annotator B does three things A doesn't: names specific criteria (completeness, accuracy), cites what's in the responses, and explains why one falls short. That is a rationale that can train a model. Annotator A's is noise.

Strategic Insight: The explanation matters more than the choice. A correct preference with a "wall-of-filler" rationale is a failed task.

You’ve probably used ChatGPT or Claude. Those tools don’t just work immediately. Someone has to teach them what “good” looks like.

That someone is you, if you do this work.

Here’s what it involves: you read an AI-generated response to a question, evaluate it against a set of criteria, and record your judgment. Usually that means picking the better of two responses, rating a single response on a scale, or writing what a good answer should look like.

The AI learns from your judgment, not just from you, but from thousands of annotators working through millions of tasks. That collective feedback shapes how the model behaves.

This sounds straightforward. It isn’t.

Most people think annotation means giving a thumbs up or down. It doesn’t. You aren’t judging whether you personally like the response. You are applying a specific rubric, a detailed set of criteria, and explaining your reasoning in writing. That last part is what most applicants get wrong.

You don’t need to understand machine learning to do this. You need to read carefully, think clearly, and be able to explain your thinking on paper.

What the work looks like

Every major platform uses three types of tasks. You will encounter all of them, but you will start with the first one.

1. Pairwise comparison (most common)

You’re shown a prompt and two AI responses. You pick the better one and write a rationale explaining why.

Example: A user asked the AI: “Explain compound interest to a 12-year-old.” You read both responses and decide which one is more accurate, more appropriate for the audience, and better follows the instructions. Then you write a sentence or two explaining your reasoning.

This is where most annotators start.

2. Response rating

Once you’ve shown you can do pairwise comparison consistently, platforms offer rating tasks. You’re shown a single AI response and asked to score it on one or more dimensions: accuracy, helpfulness, safety, format. This is usually on a 1–5 scale, sometimes with a written justification.

Example: Rate this response to a medical question on a scale of 1–5 for accuracy. Does it give correct information? Does it recommend seeing a doctor when appropriate?

This requires the same skill as pairwise comparison, but you can’t hide behind the comparison. You have to justify a standalone judgment.

3. Writing demonstrations

Some platforms ask you to write a response yourself, from scratch or by editing an AI draft, to show what a good answer looks like. These teach the model the format and quality level it’s aiming for.

Example: Write a response to this customer service question in a friendly but professional tone, under 150 words, without making any promises about refunds.

This one’s harder because you’re not evaluating someone else’s work. You are producing work that has to meet the rubric. Most annotators don’t encounter this type until they’ve proven themselves on the first two.

Why some annotators earn much more than others

This work isn’t one job. It’s a spectrum. Which tier you can access determines how much you can earn.

The gap isn’t about working harder or faster. It’s about what you can credibly evaluate. A generalist annotator and a medical doctor can both rate a medical response, but only the doctor can catch a subtly wrong dosage recommendation. Platforms know this. They pay accordingly.

Income Tier	Who gets it	What the work involves
Generalist	Anyone who passes basic screening	Pairwise comparison, general response rating, writing tasks without domain requirements
Specialist	People with a specific skill: a programming language, a second language, a professional domain	Code review, translation evaluation, domain-specific Q&A
Subject Matter Expert	People with verifiable professional backgrounds: doctors, lawyers, engineers, researchers	Clinical note evaluation, legal reasoning review, advanced STEM tasks

If you have a professional background, a STEM degree, legal training, or fluency in a non-English language, that’s not a bonus. It’s a different income tier. You should know that before you apply.

One important caveat: tier access is necessary but not sufficient. A medical doctor who writes vague rationales will earn less than a generalist who writes precise ones. The tier determines which tasks you can access. Your earnings depend on how well you do the work.

How platforms decide who gets work

Most annotators think they’re being tested on speed. Or on whether they agree with the majority. Or on whether they’ve done this work before.

None of those are what’s actually being measured.

Platforms test four things:

Calibration: Do your judgments align with experienced annotators on the same tasks? If everyone else rates a response 2/5 and you rate it 5/5, that’s a signal your evaluation criteria are off, not that you’re uniquely insightful.

Platforms need consistency across thousands of annotators. One outlier doesn’t break the system. A pattern of outliers does.

Warning: High speed often causes calibration drift. Most platforms prefer a slow, calibrated annotator over a fast, uncalibrated one.

Rationale quality: Can you explain your judgment specifically and with evidence?

“This response is better because it’s clearer” — platforms read this and mark you unreliable. You’ve stated a preference without citing anything from the exact responses.

“This response is better because it directly answers the question asked, while the other response answers a related but different question” — this passes. You’ve cited what’s in the responses and explained why it matters.

Instruction adherence: Did you follow the task rubric closely? Every task comes with detailed criteria. Platforms test whether you read and apply them, or whether you skip to your gut feeling. This is where most annotators fail without realizing it.

Consistency: Do you make the same judgment on equivalent tasks? Platforms deliberately repeat similar tasks during screening to catch inconsistency. If you rate two nearly identical responses differently on different days, that’s noise the model can’t learn from.

Speed is not the priority. A slow annotator who writes precise rationales is far more valuable than a fast one who writes vague ones.

That said, there’s a minimum throughput — if you complete only a handful of tasks during screening, platforms don’t have enough data to assess your consistency. Aim for steady, not fast.

The acceptance funnel

Most platforms don’t publish acceptance rates. The pattern across all of them: many apply, a fraction pass initial screening, a smaller fraction pass the skills test, a subset of those get consistent work.

The people who drop out aren’t less intelligent. They’re making one of a small number of fixable mistakes, and it’s usually the same mistakes.

“This response is clearer.” You’ve stated a preference without evidence. There’s no reasoning, nothing cited from the actual responses. Platforms see this and move on.

“Response A is better written.” Style preference bleeding into objective evaluation. The rubric might not mention writing quality at all. You’ve evaluated the wrong dimension, and you won’t know you did it.

“Response A is longer and more detailed.” You’ve confused length with quality. The rubric might explicitly penalize over-explanation. You applied your own assumption about what “better” means instead of reading the criteria.

“I think Response B is probably better, but I’m not totally sure.” Hedging kills your calibration score. Platforms need confidence tied to evidence. If you’re genuinely unsure, re-read the rubric. Submitting a hedged judgment signals you don’t know what you’re evaluating.

“Response A sounds more professional.” Personal preference. The rubric doesn’t mention tone. You applied your own criteria instead of the one you were given.

These aren’t random failures. They’re concentrated in specific behaviors, all of which are learnable. The pattern: the people who pass apply the rubric, explain their reasoning with evidence, and stay consistent.

It sounds obvious. It isn’t. Most annotators write rationales like the ones above and don’t realize they’re doing it.

Try It 1

A platform asks you to rate the following AI response on instruction adherence (whether the response followed the user’s instructions) on a scale of 1 (completely ignored) to 5 (followed perfectly).

User instruction: “Explain photosynthesis in two sentences, using no scientific jargon.”

AI response: “Photosynthesis is the process by which plants use sunlight, water, and carbon dioxide to produce oxygen and energy in the form of glucose. This process occurs in the chloroplasts of plant cells, where chlorophyll absorbs light energy to drive the chemical reactions.”

What score do you give? Slide to rate 1–5 for instruction adherence:

1 — completely ignored 5 — followed perfectly

1 2 3 4 5

What score would you give, and why?

See the answer

Score: 2/5

The response fails the instruction in a specific way: it uses scientific jargon throughout. “Chloroplasts,” “chlorophyll,” “glucose,” “chemical reactions”: these are terms a non-specialist wouldn’t know. The instruction said no jargon. That’s the constraint, and it was ignored.

It does meet the length requirement (two sentences), so it’s not a 1. But the most specific constraint (no jargon) was violated in both sentences, which puts it at 2.

Notice what this score is not about: the response is accurate, well-organized, and informative. None of that matters when the rubric asks specifically about instruction adherence. Evaluating the wrong dimension (scoring accuracy when the task asks for adherence) is one of the most common reasons annotators fail screening without understanding why.

Try It 2

You’re doing pairwise comparison. Both responses answer the question: “Is it safe to take ibuprofen and acetaminophen together?”

Response A: “Ibuprofen and acetaminophen can be taken together because they work on different pain pathways. Just follow the dosage on the bottle.”

Response B: “They can generally be taken together since they work through different mechanisms. However, you should check with a doctor or pharmacist if you have liver or kidney problems, take other medications, or plan to use them regularly.”

Which is better, and what’s your rationale?

See the answer

Response B is better.

Response A isn’t wrong: it’s friendly, concise, and gives the correct core answer. But for a health question, “follow the dosage on the bottle” is generic advice that doesn’t address the real risk factors. It tells you nothing about when taking both might be a problem.

Response B gives the same core answer (yes, different mechanisms, generally fine) but adds specific conditions under which the answer changes: liver or kidney problems, other medications, regular use. That’s the information that matters for someone asking this question.

A good rationale here: “Response B is better because it identifies the specific situations where the combination could be risky (existing conditions, other medications, and regular use), which Response A’s generic ‘follow dosage’ instruction doesn’t cover.”

Notice what you’re doing: you’re not saying B is better because it’s longer, or more detailed, or more professional. You’re saying it’s better because it answers the follow-up question the user didn’t know to ask.

Try It 3

Now try the third task type. A platform asks you to write a response to this customer service message:

User: “I ordered something three days ago and it still hasn’t arrived. What should I do?”

Rubric: Friendly tone, acknowledge the concern, provide specific next steps, under 100 words, do not promise a refund or specific timeline.

Write your response here before looking at the example:

See an example response

“Thanks for reaching out; I’m sorry to hear your order hasn’t arrived yet. Three days is within our normal delivery window, but here’s what I’d suggest: first, check your confirmation email for a tracking link. If the tracking shows it’s still in transit, give it one more business day. If there’s no movement or no tracking info, reply here with your order number and we’ll look into it right away.”

What makes this work against the rubric:

Friendly tone ✓ (“Thanks for reaching out,” “I’m sorry to hear”)
Acknowledges the concern ✓ (doesn’t dismiss or immediately deflect)
Specific next steps ✓ (check tracking, wait one day, then contact with order number)
Under 100 words ✓ (72 words)
No refund or timeline promise ✓ (“normal delivery window” and “one more business day” are hedged, not guaranteed)

What would fail the rubric:

“Your order is on its way! It should arrive soon. If you have any questions let us know.” — vague, no next steps, doesn’t acknowledge the concern.

“We can offer you a full refund if it doesn’t arrive by tomorrow.” — promises a refund and a specific timeline. Direct rubric violation.

Writing demonstrations are graded against the rubric criteria, not against your general sense of what sounds good. The same principle applies here as in pairwise comparison: apply the criteria, don’t substitute your own.

Quick Reference

What annotation is: Reviewing AI responses and providing structured feedback so models learn what good outputs look like.
The three task types: Pairwise comparison (picking between two), response rating (scoring one), and writing demonstrations (creating one from scratch).
What gets you hired: Rationales that cite specific evidence. “Response A is better because it correctly explains X while Response B incorrectly states Y” passes. “Response A is clearer” doesn’t.

How AI Training Jobs Work