Before you read anything β make a call
Youβre evaluating an AI response about HIPAA penalties. You find two sources:
Source A β a legal news article from 2022:
βHIPAA civil penalties can reach up to $50,000 per violation.β
Source B β the HHS (Department of Health and Human Services) official guidance page:
βThe maximum penalty per violation category is $100 for unknowing violations up to $50,000, with an annual cap of $1.5 million per violation category.β
The AI response said: βThe maximum HIPAA civil penalty is $50,000 per violation.β
Is the AI response correct? And which source do you cite?
The AI response is incomplete, not wrong. Source B is the one you cite.
Source A is a news article (Tier 3). Source B is official regulatory guidance from the governing body (Tier 1). When those two conflict, Tier 1 wins. This isn't because news articles are always wrong. The regulatory body is the authority on its own rules.
Insight: The core skill here is not just finding a number. You need to understand what the source says and what it doesn't. Apparent conflicts between sources are often just a matter of a complete account versus a partial one.
Prove your answer
Platforms distinguish between two types of annotation work. Most tasks ask for your judgment. You might decide which response is more helpful, or which tone fits the context. A subset of tasks, usually the highest-paying ones, ask for something harder. You have to establish what is true.
That is ground truth. It is not a consensus of opinions. It is not what the AI generated. It is the answer that can be confirmed against a primary source, independent of anyoneβs judgment, including yours.
The reason expert-tier annotation pays more isnβt just because itβs complex. Itβs because the stakes of being wrong are higher.
A model trained on incorrect medical dosing or faulty legal reasoning will reproduce those errors at scale. Establishing ground truth in high-stakes domains requires knowing what to look for and where to look. You must be able to tell primary from secondary from unreliable.
Verifiable vs. judgment-dependent
The first skill is recognizing which type of question youβre dealing with.
Verifiable questions have a single correct answer. You can confirm this against an authoritative source, independent of whoβs checking:
- βWhat is the capital gains tax rate for assets held under one year in the US?β β Verifiable against IRS Publication 550.
- βWhat is the time complexity of merge sort?β β Verifiable against any algorithms textbook.
- βUnder HIPAA, what is the minimum necessary standard?β β Verifiable against 45 CFR Β§164.502(b).
Judgment-dependent questions require weighing competing considerations where reasonable experts disagree:
- βIs this investment strategy appropriate for this client?β β Depends on undisclosed risk tolerance, time horizon, tax situation.
- βIs this legal argument persuasive?β β Depends on jurisdiction, judge, standard of review.
When tasks involve judgment-dependent questions, your job shifts. You arenβt finding the one correct answer. You are evaluating whether the AIβs reasoning is sound, identifies the key considerations, and acknowledges appropriate uncertainty.
Most annotation tasks mix both types. A response about a medication might contain a verifiable dosing claim embedded in judgment-dependent advice about whether to take it. The verifiable claim gets checked against a primary source. The judgment call gets evaluated on the quality of the reasoning. Donβt confuse them. Each gets evaluated on its own terms.
The research hierarchy
When building ground truth, source quality determines whether your rationale holds up.
Tier 1 (Primary sources, always preferred): Statutes, regulations, and official government publications; peer-reviewed journal articles; official standards documents; court opinions and rulings; company filings.
Tier 2 (Authoritative secondary sources): Textbooks from major academic publishers when citing established principles; professional body guidance; central bank publications and major regulatory agency reports.
Tier 3 (Reference only, never citable): Wikipedia, news articles, blog posts, general AI-generated summaries.
Your rationale must trace to a Tier 1 or strong Tier 2 source. If you canβt find one, thatβs important information. It means the claim may be contested, jurisdiction-dependent, or unknown.
Say so. βI couldnβt find a primary source for this claimβ is a valid and useful annotation. Inventing certainty is not.
Try It: choose the right source
Youβre reviewing an AI response that states: βCompany X reported $4.2 billion in revenue for FY2023.β
Which of the following sources would you use to verify this claim? Rank them from most to least appropriate.
- A) The companyβs Wikipedia page
- B) A Bloomberg news article reporting on the companyβs earnings call
- C) The companyβs 10-K annual report filed with the SEC (found on EDGAR)
- D) A blog post by a financial analyst summarizing the earnings
See answer
Best to least appropriate: C β B β D β A
C (10-K filing) is the definitive source. Revenue figures in a 10-K are audited, filed under penalty of law, and represent the primary regulatory record. This is your Tier 1 source. It is the one you cite in a rationale.
B (Bloomberg) is a strong Tier 2. It is a reputable financial news organization reporting on a specific filing. Itβs acceptable as corroboration. Always trace back to the 10-K itself if the claim matters.
D (analyst blog) is Tier 3. It may be accurate but introduces an interpretation layer. Analysts sometimes round, restate, or adjust figures. It is not citable in a rationale.
A (Wikipedia) is Tier 3 and the worst choice. Wikipediaβs financial figures are often outdated, may use non-standard revenue definitions, and can be edited by anyone. Never cite Wikipedia for a financial fact in an annotation rationale.
The practical move is to go to EDGAR, find the 10-K, and search the document for βrevenueβ or βnet sales.β The exact figure is in the income statement. Thatβs your citation.
How to structure a rationale
A weak rationale, like βResponse A is more accurateβ, tells the platform nothing. A strong one shows the reasoning chain that produced your judgment. It should cover what the AI claimed, what the source says, and where they diverge.
In practice, that means hitting five things: the specific claim being evaluated, the controlling source, the relevant passage quoted or closely paraphrased, the discrepancy if there is one, and your verdict. Do not treat this as a checklist. Write it as a clear paragraph that a stranger could read and verify.
Here is what it looks like in a real finance task:
βThe AI response states that goodwill must be amortized over its useful life. This is incorrect under US GAAP. Per ASC 350-20-35-1, goodwill is not amortized but is instead tested for impairment at least annually. The response would be accurate under IFRS for SMEs (Section 19.23), but the prompt specified a US publicly traded company, making ASC the applicable standard. Rating: Factually incorrect for the stated context.β
That rationale names the specific standard, cites the paragraph, explains why jurisdiction matters, and gives a clear verdict. There is no ambiguity about what the annotator checked or why they concluded what they did.
That is the standard. Anything less is noise.
Domain-specific source types
Each expert domain has canonical sources. The ones below are where you go first β though for legal research specifically, note that Westlaw and LexisNexis require subscriptions most annotators wonβt have. Google Scholar Legal covers the majority of US case law and is free.
Finance: IRS publications and Internal Revenue Code (IRC) sections for tax; FASB ASC for US accounting standards; IFRS for international; SEC regulations (17 CFR); CFA Institute curriculum for investment concepts; Federal Reserve publications for monetary policy. When verifying a public company figure, go to SEC EDGAR first β the 10-K is always the controlling document.
Medicine: PubMed/MEDLINE for clinical research; FDA labeling and DailyMed for pharmacology (free, no subscription). UpToDate and clinical practice guidelines from specialty societies (ACC, ADA, ASCO) are strong Tier 2. For dosing questions specifically, DailyMed is faster and more authoritative than any textbook.
Law: Google Scholar Legal for US case law (free); the specific jurisdictionβs statutes and regulations for statutory questions; CFR for federal regulations; Restatements for common law principles. When a claim depends on jurisdiction, identify the jurisdiction before you search β the same legal rule can differ significantly across states.
Algorithms / Engineering: Peer-reviewed papers (arXiv, ACM Digital Library, IEEE Xplore); NIST standards; official language and framework documentation (Python docs, RFC documents for networking). For complexity claims, a peer-reviewed source beats a textbook; the textbook beats a blog post.
Consulting / Business: Industry reports from major research firms (McKinsey, Gartner, IBISWorld) as Tier 2; peer-reviewed management journals (HBR, MIT Sloan) for frameworks; government statistical agencies (BLS, Census Bureau) for economic data. Be careful with market size figures β methodology varies between research firms, and two credible sources can disagree significantly.
Try It: is this rationale valid?
A valid rationale: βThe AIβs stated starting dose of 500 mg twice daily is consistent with the FDA-approved prescribing informationβ¦β
See answer
This rationale is invalid. Two problems:
βGeneral knowledge about the pharma industryβ is not a source. Itβs an impression. The annotator canβt point to a specific document that can be independently checked. This fails the core requirement that rationales be traceable.
βWidely understoodβ is not evidence. Many widely-held medical beliefs are outdated or jurisdiction-specific. Dosing guidance changes with new research and regulatory updates. βEveryone knowsβ is the kind of certainty that ground truth research is designed to replace.
A valid rationale: βThe AIβs stated starting dose of 500 mg twice daily is consistent with the FDA-approved prescribing information for metformin (DailyMed NDA 021202), which states: βThe usual starting dose of metformin hydrochloride tablets is 500 mg twice a day.β Rating: Factually accurate for standard adult dosing.β
A specific document, a specific passage, independently checkable. Anyone with internet access could open DailyMed and confirm that sentence exists. Thatβs what βtraceableβ means.
Try It: conflicting sources
This is the same HIPAA scenario from the Entry Simulation β extended. Youβve already seen how to handle a complete versus incomplete source. Now the harder version: what if the sources genuinely contradict?
Youβre evaluating an AI response about the maximum penalty for a first-time HIPAA violation. You find:
Source A (HHS official guidance): βThe maximum penalty per violation category is $50,000, with an annual cap of $1.5 million per violation category.β
Source B (a 2023 federal court ruling in your jurisdiction): βThe court finds the HHS penalty cap calculation methodology inconsistent with the statutory text of 42 U.S.C. Β§ 1320d-5, and declines to apply it as written.β
Both are Tier 1. They conflict. What do you do?
Both are Tier 1. What do you do?
See answer
Flag the task and document both sources explicitly.
This is a genuine Tier 1 conflict. It puts regulatory guidance against a court ruling applying the underlying statute. You cannot silently pick one. The court ruling doesnβt override HHS guidance nationally, but it does mean the penalty cap is legally contested in at least one jurisdiction. Thatβs material to any annotation task asking about HIPAA enforcement.
What to write: cite both sources. Identify the nature of the conflict (regulatory guidance vs. statutory interpretation). Note the jurisdictional scope of the ruling. Flag that the claim as written (βthe maximum penalty is $50,000β) is accurate under HHS guidance but contested in federal court. Do not assert a verdict. The contradiction is the finding.
This is different from the Entry Simulation case, where the apparent conflict was just an incomplete versus complete account of the same fact. Here thereβs no reconciliation available. Two authoritative sources disagree about what the law means. The model needs to learn that some legal questions donβt have a single clean answer. Your job is to make sure it learns that, not to paper over the ambiguity by picking a side.
Common ground truth mistakes
Citing Wikipedia. Hereβs what that looks like in a rationale: βAccording to Wikipedia, the standard amortization period for goodwill under GAAP is 10 years.β A reviewer reads this and rejects it immediately. This is not because Wikipedia is wrong (it may not be). It is because it traces to nothing. Wikipedia is written by volunteers and updated daily. Even accurate Wikipedia content traces back to a primary source. Go there directly.
Treating AI output as confirmation. The mistake looks like this: youβre unsure whether a claim is correct, so you ask another AI tool. It confirms the claim, and you cite that as verification. Thatβs circular. The AIβs output cannot validate the AIβs output. This is the exact failure mode you are being paid to catch. Doing it in your rationale means youβve become the problem.
Confusing βcommon knowledgeβ with βverified.β Widely-held beliefs in medicine, finance, and law are frequently outdated, jurisdiction-specific, or simply wrong. The more confident a claim sounds, the more you need to verify it. βEveryone knows Xβ has no place in a ground truth rationale.
Missing the context qualifier. βThe capital gains rate is 15%β is only true for certain income levels and asset types. Ground truth without its scope conditions isnβt ground truth. It is an oversimplification that can mislead model training. Always check what the source says the claim applies to, not just what the number is.
Stopping at the first hit. The first search result is usually not the primary source. Follow the chain: blog β news article β textbook β journal β statute or study. The chain ends at something that can be cited. If you stop in the middle, your rationale is only as strong as whatever you stopped at.
Quick Reference
- Ground truth in practice: The answer that can be confirmed against a named, traceable primary source β not just an authoritative sound.
- The source hierarchy: Tier 1 (statutes, regulation, peer-reviewed) β Tier 2 (textbooks, guidance) β Tier 3 (Wikipedia, news, AI).
- When sources conflict: Document the Tier 1 contradiction (donβt pick a side) or cite the most complete account.