Target icon for interactive thinking task

Before you read anything — make a call

You’re evaluating an AI response about HIPAA penalties. You find two sources:

Source A — a legal news article from 2022:

“HIPAA civil penalties can reach up to $50,000 per violation.”

Source B — the HHS (Department of Health and Human Services) official guidance page:

“The maximum penalty per violation category is $100 for unknowing violations up to $50,000, with an annual cap of $1.5 million per violation category.”

The AI response said: “The maximum HIPAA civil penalty is $50,000 per violation.”

"Correct: Cite Source A (News article from 2022)."

"Correct: Cite Source B (HHS Official Guidance)."

Is the AI response correct? And which source do you cite?

The AI response is incomplete, not wrong. Source B is the one you cite.

Source A is a news article (Tier 3). Source B is official regulatory guidance from the governing body (Tier 1). When those two conflict, Tier 1 wins. This isn't because news articles are always wrong. The regulatory body is the authority on its own rules.

Insight: The core skill here is not just finding a number. You need to understand what the source says and what it doesn't. Apparent conflicts between sources are often just a matter of a complete account versus a partial one.

Prove your answer

Platforms distinguish between two types of annotation work. Most tasks ask for your judgment. You might decide which response is more helpful, or which tone fits the context. A subset of tasks, usually the highest-paying ones, ask for something harder. You have to establish what is true.

That is ground truth. It is not a consensus of opinions. It is not what the AI generated. It is the answer that can be confirmed against a primary source, independent of anyone’s judgment, including yours.

The reason expert-tier annotation pays more isn’t just because it’s complex. It’s because the stakes of being wrong are higher.

A model trained on incorrect medical dosing or faulty legal reasoning will reproduce those errors at scale. Establishing ground truth in high-stakes domains requires knowing what to look for and where to look. You must be able to tell primary from secondary from unreliable.

Verifiable vs. judgment-dependent

The first skill is recognizing which type of question you’re dealing with.

Verifiable questions have a single correct answer. You can confirm this against an authoritative source, independent of who’s checking:

“What is the capital gains tax rate for assets held under one year in the US?” → Verifiable against IRS Publication 550.
“What is the time complexity of merge sort?” → Verifiable against any algorithms textbook.
“Under HIPAA, what is the minimum necessary standard?” → Verifiable against 45 CFR §164.502(b).

Judgment-dependent questions require weighing competing considerations where reasonable experts disagree:

“Is this investment strategy appropriate for this client?” → Depends on undisclosed risk tolerance, time horizon, tax situation.
“Is this legal argument persuasive?” → Depends on jurisdiction, judge, standard of review.

When tasks involve judgment-dependent questions, your job shifts. You aren’t finding the one correct answer. You are evaluating whether the AI’s reasoning is sound, identifies the key considerations, and acknowledges appropriate uncertainty.

Most annotation tasks mix both types. A response about a medication might contain a verifiable dosing claim embedded in judgment-dependent advice about whether to take it. The verifiable claim gets checked against a primary source. The judgment call gets evaluated on the quality of the reasoning. Don’t confuse them. Each gets evaluated on its own terms.

The research hierarchy

When building ground truth, source quality determines whether your rationale holds up.

Tier 1 (Primary sources, always preferred): Statutes, regulations, and official government publications; peer-reviewed journal articles; official standards documents; court opinions and rulings; company filings.

Tier 2 (Authoritative secondary sources): Textbooks from major academic publishers when citing established principles; professional body guidance; central bank publications and major regulatory agency reports.

Tier 3 (Reference only, never citable): Wikipedia, news articles, blog posts, general AI-generated summaries.

Your rationale must trace to a Tier 1 or strong Tier 2 source. If you can’t find one, that’s important information. It means the claim may be contested, jurisdiction-dependent, or unknown.

Say so. “I couldn’t find a primary source for this claim” is a valid and useful annotation. Inventing certainty is not.

Try It: choose the right source

You’re reviewing an AI response that states: “Company X reported $4.2 billion in revenue for FY2023.”

Which of the following sources would you use to verify this claim? Rank them from most to least appropriate.

A) The company’s Wikipedia page
B) A Bloomberg news article reporting on the company’s earnings call
C) The company’s 10-K annual report filed with the SEC (found on EDGAR)
D) A blog post by a financial analyst summarizing the earnings

See answer

Best to least appropriate: C → B → D → A

C (10-K filing) is the definitive source. Revenue figures in a 10-K are audited, filed under penalty of law, and represent the primary regulatory record. This is your Tier 1 source. It is the one you cite in a rationale.

B (Bloomberg) is a strong Tier 2. It is a reputable financial news organization reporting on a specific filing. It’s acceptable as corroboration. Always trace back to the 10-K itself if the claim matters.

D (analyst blog) is Tier 3. It may be accurate but introduces an interpretation layer. Analysts sometimes round, restate, or adjust figures. It is not citable in a rationale.

A (Wikipedia) is Tier 3 and the worst choice. Wikipedia’s financial figures are often outdated, may use non-standard revenue definitions, and can be edited by anyone. Never cite Wikipedia for a financial fact in an annotation rationale.

The practical move is to go to EDGAR, find the 10-K, and search the document for “revenue” or “net sales.” The exact figure is in the income statement. That’s your citation.

How to structure a rationale

A weak rationale, like “Response A is more accurate”, tells the platform nothing. A strong one shows the reasoning chain that produced your judgment. It should cover what the AI claimed, what the source says, and where they diverge.

In practice, that means hitting five things: the specific claim being evaluated, the controlling source, the relevant passage quoted or closely paraphrased, the discrepancy if there is one, and your verdict. Do not treat this as a checklist. Write it as a clear paragraph that a stranger could read and verify.

Here is what it looks like in a real finance task:

“The AI response states that goodwill must be amortized over its useful life. This is incorrect under US GAAP. Per ASC 350-20-35-1, goodwill is not amortized but is instead tested for impairment at least annually. The response would be accurate under IFRS for SMEs (Section 19.23), but the prompt specified a US publicly traded company, making ASC the applicable standard. Rating: Factually incorrect for the stated context.”

That rationale names the specific standard, cites the paragraph, explains why jurisdiction matters, and gives a clear verdict. There is no ambiguity about what the annotator checked or why they concluded what they did.

That is the standard. Anything less is noise.

Domain-specific source types

Each expert domain has canonical sources. The ones below are where you go first — though for legal research specifically, note that Westlaw and LexisNexis require subscriptions most annotators won’t have. Google Scholar Legal covers the majority of US case law and is free.

Finance: IRS publications and Internal Revenue Code (IRC) sections for tax; FASB ASC for US accounting standards; IFRS for international; SEC regulations (17 CFR); CFA Institute curriculum for investment concepts; Federal Reserve publications for monetary policy. When verifying a public company figure, go to SEC EDGAR first — the 10-K is always the controlling document.

Medicine: PubMed/MEDLINE for clinical research; FDA labeling and DailyMed for pharmacology (free, no subscription). UpToDate and clinical practice guidelines from specialty societies (ACC, ADA, ASCO) are strong Tier 2. For dosing questions specifically, DailyMed is faster and more authoritative than any textbook.

Law: Google Scholar Legal for US case law (free); the specific jurisdiction’s statutes and regulations for statutory questions; CFR for federal regulations; Restatements for common law principles. When a claim depends on jurisdiction, identify the jurisdiction before you search — the same legal rule can differ significantly across states.

Algorithms / Engineering: Peer-reviewed papers (arXiv, ACM Digital Library, IEEE Xplore); NIST standards; official language and framework documentation (Python docs, RFC documents for networking). For complexity claims, a peer-reviewed source beats a textbook; the textbook beats a blog post.

Consulting / Business: Industry reports from major research firms (McKinsey, Gartner, IBISWorld) as Tier 2; peer-reviewed management journals (HBR, MIT Sloan) for frameworks; government statistical agencies (BLS, Census Bureau) for economic data. Be careful with market size figures — methodology varies between research firms, and two credible sources can disagree significantly.

Try It: is this rationale valid?

A valid rationale: “The AI’s stated starting dose of 500 mg twice daily is consistent with the FDA-approved prescribing information…”

See answer

This rationale is invalid. Two problems:

“General knowledge about the pharma industry” is not a source. It’s an impression. The annotator can’t point to a specific document that can be independently checked. This fails the core requirement that rationales be traceable.

“Widely understood” is not evidence. Many widely-held medical beliefs are outdated or jurisdiction-specific. Dosing guidance changes with new research and regulatory updates. “Everyone knows” is the kind of certainty that ground truth research is designed to replace.

A valid rationale: “The AI’s stated starting dose of 500 mg twice daily is consistent with the FDA-approved prescribing information for metformin (DailyMed NDA 021202), which states: ‘The usual starting dose of metformin hydrochloride tablets is 500 mg twice a day.’ Rating: Factually accurate for standard adult dosing.”

A specific document, a specific passage, independently checkable. Anyone with internet access could open DailyMed and confirm that sentence exists. That’s what “traceable” means.

Try It: conflicting sources

This is the same HIPAA scenario from the Entry Simulation — extended. You’ve already seen how to handle a complete versus incomplete source. Now the harder version: what if the sources genuinely contradict?

You’re evaluating an AI response about the maximum penalty for a first-time HIPAA violation. You find:

Source A (HHS official guidance): “The maximum penalty per violation category is $50,000, with an annual cap of $1.5 million per violation category.”

Source B (a 2023 federal court ruling in your jurisdiction): “The court finds the HHS penalty cap calculation methodology inconsistent with the statutory text of 42 U.S.C. § 1320d-5, and declines to apply it as written.”

Both are Tier 1. They conflict. What do you do?

Both are Tier 1. What do you do?

See answer

Flag the task and document both sources explicitly.

This is a genuine Tier 1 conflict. It puts regulatory guidance against a court ruling applying the underlying statute. You cannot silently pick one. The court ruling doesn’t override HHS guidance nationally, but it does mean the penalty cap is legally contested in at least one jurisdiction. That’s material to any annotation task asking about HIPAA enforcement.

What to write: cite both sources. Identify the nature of the conflict (regulatory guidance vs. statutory interpretation). Note the jurisdictional scope of the ruling. Flag that the claim as written (“the maximum penalty is $50,000”) is accurate under HHS guidance but contested in federal court. Do not assert a verdict. The contradiction is the finding.

This is different from the Entry Simulation case, where the apparent conflict was just an incomplete versus complete account of the same fact. Here there’s no reconciliation available. Two authoritative sources disagree about what the law means. The model needs to learn that some legal questions don’t have a single clean answer. Your job is to make sure it learns that, not to paper over the ambiguity by picking a side.

Common ground truth mistakes

Citing Wikipedia. Here’s what that looks like in a rationale: “According to Wikipedia, the standard amortization period for goodwill under GAAP is 10 years.” A reviewer reads this and rejects it immediately. This is not because Wikipedia is wrong (it may not be). It is because it traces to nothing. Wikipedia is written by volunteers and updated daily. Even accurate Wikipedia content traces back to a primary source. Go there directly.

Treating AI output as confirmation. The mistake looks like this: you’re unsure whether a claim is correct, so you ask another AI tool. It confirms the claim, and you cite that as verification. That’s circular. The AI’s output cannot validate the AI’s output. This is the exact failure mode you are being paid to catch. Doing it in your rationale means you’ve become the problem.

Confusing “common knowledge” with “verified.” Widely-held beliefs in medicine, finance, and law are frequently outdated, jurisdiction-specific, or simply wrong. The more confident a claim sounds, the more you need to verify it. “Everyone knows X” has no place in a ground truth rationale.

Missing the context qualifier. “The capital gains rate is 15%” is only true for certain income levels and asset types. Ground truth without its scope conditions isn’t ground truth. It is an oversimplification that can mislead model training. Always check what the source says the claim applies to, not just what the number is.

Stopping at the first hit. The first search result is usually not the primary source. Follow the chain: blog → news article → textbook → journal → statute or study. The chain ends at something that can be cited. If you stop in the middle, your rationale is only as strong as whatever you stopped at.

Quick Reference

Ground truth in practice: The answer that can be confirmed against a named, traceable primary source — not just an authoritative sound.
The source hierarchy: Tier 1 (statutes, regulation, peer-reviewed) → Tier 2 (textbooks, guidance) → Tier 3 (Wikipedia, news, AI).
When sources conflict: Document the Tier 1 contradiction (don’t pick a side) or cite the most complete account.

Ground Truth Research Method

Prove your answer

Verifiable vs. judgment-dependent

The research hierarchy

How to structure a rationale

Domain-specific source types

Common ground truth mistakes

Quick Reference

Test Your Knowledge

1. Which of these is a primary source for a medical claim about drug dosage?

2. For a rationale to be valid in a high-stakes annotation task, it must include what?

3. Wikipedia is considered what tier of source in the research hierarchy?

4. When is it acceptable to use an AI tool to verify a factual claim in an annotation task?

5. For a finance task requiring verification of a public company's annual revenue, which source is most appropriate?

6. What is the defining characteristic that makes an answer verifiable rather than judgment-dependent?

7. A rubric criterion of type 'Extraction' requires the evaluator to check what?

8. Two authoritative sources directly contradict each other on a factual claim. What is the correct approach?

Sign in to see your results

Results

How did this quiz feel?

Was this worth your time?

$150–$225/hr. Lawyers, MDs and Finance Experts Wanted.

Get Paid for the Expertise You Already Have