Before you read anything — make a call

The instruction for a task reads:

“Explain what a binary search tree is. Do not use bullet points. Keep the explanation under 80 words.”

Here are two AI responses. Which one violates the instruction?

Response A: “A binary search tree (BST) organizes data so that for any node, smaller values sit to the left and larger values to the right. This structure makes searching efficient — you eliminate half the remaining nodes at each step. BSTs are common in databases and file systems, though performance degrades if the tree becomes unbalanced.” (58 words, no bullet points.)

Response B: “A binary search tree stores data with a key property: each node’s left child holds a smaller value, and its right child holds a larger one. This makes lookups, insertions, and deletions fast — typically O(log n) for balanced trees.

Key characteristics:

Left subtree values are always smaller than the parent node
Right subtree values are always larger
Searching eliminates half the tree at each step”

Which response violates the instruction, and what did it violate?

See the answer

Response B violates the instruction. It uses bullet points, which are explicitly prohibited.

Response A satisfies both hard constraints: 58 words (under 80) and prose format (no bullets). It may not be perfect. You could argue it glosses over certain details. But it follows the instruction.

Response B is arguably the better explanation. The bullet points make the key characteristics clearer. The O(log n) callout is a useful addition. None of that matters. The instruction said “do not use bullet points.” That constraint is binary: met or not met. A response can be excellent in every other dimension and still fail on this one.

Compliance and quality are separate things. You score them separately. One does not redeem the other.

The mistake experienced annotators still make

A response can be accurate, well-written, genuinely helpful, and still fail the task because it ignored a constraint. A response can follow every instruction to the letter and still be a weak answer. These get scored on different rubric dimensions.

Conflating them is one of the most common reasons experienced annotators get flagged. Marking a hard violation as merely a “quality issue,” or letting a great response slide past a constraint breach, is a mistake. Why? Because the instinct to reward quality is strong. Overriding it requires deliberate attention to the rubric structure.

Parsing multi-part instructions

The method first: before you read the AI response, number every distinct constraint in the instruction. After you’ve read the response, go through your list one by one. Don’t mark a constraint satisfied until you’ve found the specific part of the response that satisfies it.

Here is why that order matters. Take this instruction:

“Write a professional email declining a job offer. Keep it under 150 words. Do not mention salary as a reason. Express genuine appreciation. Use a formal but warm tone. End with an offer to stay in touch.”

A fast reader sees “decline a job offer email.” A careful reader, numbering as they go, finds six distinct constraints. These are: genre (professional email), length (under 150 words, hard limit), content exclusion (no salary mention), sentiment (genuine appreciation, not perfunctory), tone (formal but warm simultaneously), and a specific closing requirement.

An AI that writes a beautiful, warm 200-word email has violated constraint 2. An AI that stays under 150 words but sounds curt has violated constraints 4 and 5. Partial compliance is not compliance. You won’t catch it unless you checked each constraint against the response, rather than against your general impression of it.

Try It: parse the constraints

You receive the following instruction for a task evaluation:

“Write a customer support reply to a user who received the wrong item. Keep it under 100 words. Apologize sincerely without using the word ‘sorry.’ Offer a replacement or refund. Do not mention processing times or shipping estimates. Use second person throughout.”

List every distinct constraint you can identify. How many are there?

See answer

There are seven distinct constraints (six if you count both content exclusions as one criterion — but they target different content and should be checked independently):

Genre/purpose: customer support reply (not a generic message)
Length: under 100 words (hard limit)
Sentiment: sincere apology (quality bar on tone)
Content exclusion: the word “sorry” must not appear
Content requirement: must offer replacement or refund (both options, or the choice)
Content exclusion: no mention of processing times or shipping estimates
Voice: second person throughout (“you,” “your” — not “the customer” or third person)

Annotators who parse this as “apologize and offer a solution” are missing at least half the constraint set. Every one of these must be checked independently when evaluating an AI response.

When to infer vs. when to flag

Instructions are sometimes ambiguous. The decision rule is simple. You should infer when the ambiguity has one obvious resolution and asking would waste everyone’s time. You should flag when two interpretations would produce substantially different outputs and a wrong guess could invalidate the response entirely.

In practice, the harder judgment is recognizing which situation you’re in.

A prompt says “Summarize this document for a non-expert audience.” The document is a clinical trial report. The instruction doesn’t specify length. You can reasonably infer “summary” means condensed (one to three paragraphs) and that “non-expert” means avoid jargon. Don’t stop and ask “how many words?” Apply your best judgment and proceed. The ambiguity has one reasonable resolution, and asking would signal you didn’t read carefully.

Now change the context. The prompt is the same, but it says “Summarize this for the CEO’s board presentation.” The document covers conflicting trial arms. If you’re unsure whether to include the unfavorable results, that’s worth flagging. Omitting them could constitute a material omission in a high-stakes context. That is a judgment call that could make the entire response misleading. You can’t resolve it by inference, because both interpretations are reasonable and the cost of guessing wrong is high.

The difference isn’t ambiguity level. Both instructions are vague about scope. The difference is the cost of a wrong assumption.

When constraints appear to conflict

The hardest tasks require satisfying multiple constraint types simultaneously. Sometimes those constraints look incompatible.

A 100-word response that must cover five complex points thoroughly will be superficial on each. “Explain this to a 10-year-old in a formal register” creates genuine tension, as formal language is rarely simple. “Cover eight items in flowing prose under 300 words” means something has to give.

When you see this, your job as an evaluator is not to pick a winner. It is to assess whether the AI made a reasonable and transparent trade-off. A response slightly over the word limit that achieves all other constraints may be preferable to one that hits the limit by omitting crucial content.

But hard constraints are hard. If the instruction says “do not include any code,” a response with code violates it regardless of quality. If it says “respond only in French,” a response in English fails regardless of accuracy. The apparent flexibility of adjacent constraints doesn’t dilute an explicit hard one.

The test for any apparent conflict: can these constraints be satisfied simultaneously with good craft? “Concise” means no padding, not short. “Comprehensive” means all required points, not exhaustive. Most apparent conflicts resolve on closer reading. Only escalate when satisfying one constraint necessarily breaks another.

Try It: violation or quality issue?

The instruction was: “Explain what a binary search tree is. Do not use bullet points. Keep the explanation under 80 words.”

The AI responded with:

“A binary search tree (BST) is a data structure where each node has at most two children. For any node, all values in the left subtree are smaller, and all values in the right subtree are larger. This property makes searching, insertion, and deletion efficient — typically O(log n) for balanced trees. BSTs are widely used in databases and file systems.” (62 words, no bullet points.)

Classify each observation as either an instruction VIOLATION or a quality issue:

A) The response does not define what a “node” is for a non-technical reader.
B) The response uses passive voice (“are smaller,” “are larger”).
C) The response does not mention tree traversal methods.
D) The response is 62 words and uses prose, not bullet points.

See answer

A) Quality issue. The instruction did not specify the audience. Not defining “node” may be a clarity weakness, but it’s not a violation of any explicit constraint.

B) Quality issue. The instruction did not require active voice. Passive constructions may reduce readability, but passive voice is not prohibited here.

C) Quality issue. The instruction did not require tree traversal. Omitting it reflects a content choice, not a constraint breach.

D) Neither — this is compliance. The response satisfies both hard constraints: under 80 words (62 words) and no bullet points.

Violations require a specific explicit constraint that was broken. Everything else — clarity, completeness beyond what was specified, stylistic choices — is a quality judgment scored on a different rubric axis. Keep them separate on the scorecard, because they are separate questions.

Try It: resolve a constraint conflict

The instruction reads: “Write a brief summary of this legal case. Be concise. Include all relevant legal arguments from both sides.”

A junior annotator flags this as an invalid instruction because “be concise” and “include all relevant legal arguments from both sides” conflict. The case involves four distinct arguments per side.

Is this instruction contradictory? How should you evaluate an AI response against it?

See answer

The instruction is not contradictory. It requires careful scoping.

“Concise” does not mean “short.” It means “free of unnecessary content.” An AI can be concise while still covering all relevant arguments if it doesn’t pad, repeat itself, or include irrelevant background.

“All relevant legal arguments” has a scope qualifier: relevant. An AI that identifies the four key arguments per side and states each in one tight sentence satisfies both constraints simultaneously.

Check the AI response for two things. Did it omit any required arguments (violation of completeness)? Did it include unnecessary content that inflated length (violation of conciseness)? Either one is a distinct failure. Both can be true at once.

Only escalate when satisfying one constraint necessarily means violating another. For example, “use 3 sentences” AND “cover 10 distinct points” genuinely cannot coexist.

Try It: parse a rubric criterion

Real annotation rubrics specify criteria with dependencies and types. Read this criterion carefully:

Criterion: Factual Accuracy (Primary objective, Extraction type) Applies only if the response attempts to answer the factual question. If the response declines to answer, score this criterion N/A.

A user asked: “What is the boiling point of water at sea level?”

You’re evaluating two AI responses against this criterion:

Response A: “I’d rather not answer specific science questions as they can vary by context.”

Response B: “Water boils at 100°C (212°F) at sea level under standard atmospheric pressure.”

For each response: does the Factual Accuracy criterion apply? If it applies, what’s the score?

See answer

Response A: The criterion does not apply, so score it N/A. The response declined to answer the factual question. Applying the Factual Accuracy criterion here would be a scoring error, even though the response is arguably unhelpful.

Response B: The criterion applies, and the score is high (4-5/5). The response correctly states 100°C / 212°F at sea level under standard atmospheric pressure. The caveat about atmospheric pressure is accurate and appropriate. It acknowledges the condition under which the claim is true without hedging unnecessarily.

This is what a dependent criterion looks like in practice. Mark a criterion N/A when it doesn’t apply. That isn’t avoiding responsibility; that is reading the rubric correctly.

The practical consequence is that if you score Response A on Factual Accuracy and give it a 1, you’ve penalized the response on a dimension the rubric says not to evaluate. That is a scoring error on your record, not a quality flag on the AI.

Quick Reference

Compliance and quality are separate scores: A response can be excellent and still violate a hard constraint. A response can follow every instruction and still be a weak answer. Score them on separate rubric dimensions, and don’t let one contaminate the other.
How to parse multi-part instructions: Number every distinct constraint before you start evaluating. Check each one independently after reading the response. Don’t mark a constraint satisfied until you’ve found the specific part of the response that satisfies it.
When constraints appear to conflict: Ask whether they can be satisfied simultaneously with good craft. “Concise” means no padding, not short. “Comprehensive” means all required points, not exhaustive. Most apparent conflicts resolve on closer reading. Only escalate when satisfying one constraint necessarily breaks another.

Advanced Instruction Following

The mistake experienced annotators still make

Parsing multi-part instructions

When to infer vs. when to flag

When constraints appear to conflict

Quick Reference

Test Your Knowledge

1. An instruction says 'respond in 3 bullet points, each under 20 words.' The AI responds with 3 bullets averaging 25 words each. This is best classified as what?

2. When should you infer intent rather than ask for clarification on an ambiguous instruction?

3. A response satisfies all format requirements but gives a factually wrong answer. Which rubric criterion type does it fail?

4. What is a 'dependent criterion' in a rubric?

5. An instruction says 'be concise AND comprehensive.' These constraints conflict directly. What is the correct approach when evaluating an AI response?

6. Which of the following is an instruction VIOLATION rather than a quality issue?

7. What does 'primary objective weight' mean in a multi-criterion rubric?

8. A response correctly answers the question but uses the wrong format — prose instead of the required numbered list. How should this be evaluated?

Sign in to see your results

Results

How did this quiz feel?

Was this worth your time?

$150–$225/hr. Lawyers, MDs and Finance Experts Wanted.

Get Paid for the Expertise You Already Have