Before you read anything: make a call
You receive a clinical case document for a medical annotation task. Read this excerpt and identify anything that should not be in a training dataset:
“Patient Margaret T. Holloway, DOB 07/14/1958, MRN 4421890, residing at 112 Birchwood Lane, Springfield, IL 62704, presented to Springfield General Hospital on 03/02/2024 with complaints of persistent fatigue. She can be reached at (217) 555-0183 or mholloway@email.com. Primary insurance: BlueCross policy #ZKL-8833291.”
How many elements in this excerpt need to be removed or replaced before this document can be used in AI training data?
See the answer
Nine elements must be removed. The clinical content (“persistent fatigue”) is the only part that can stay.
Here’s the full list under HIPAA’s Safe Harbor de-identification method:
- Full name: “Margaret T. Holloway”
- Date of birth: “07/14/1958” (full dates beyond year are Protected Health Information, or PHI)
- Medical record number (MRN): “4421890”
- Street address: “112 Birchwood Lane”
- City and zip code: “Springfield, IL 62704” (zip codes serving fewer than 20,000 residents must be suppressed; full zip codes are generally PHI)
- Date of service: “03/02/2024”
- Phone number: “(217) 555-0183”
- Email address: “mholloway@email.com”
- Health plan beneficiary number: “ZKL-8833291”
Each of these is one of HIPAA’s 18 PHI identifier categories. A correctly de-identified version replaces each with a bracketed placeholder: [NAME], [DOB], [MRN], and so on.
The hospital name (“Springfield General Hospital”) may become PHI in combination with other details, but standing alone it’s not one of the 18 identifiers. “Persistent fatigue” is clinical data, not an identifier. Those stay.
If you receive a task document with apparent real PHI that wasn’t intentionally included as a training example (like a real name and date of birth where you’d expect synthetic data), you flag it and stop. You don’t continue working with it as if it were de-identified.
Evaluate against the source, not from memory
Medical annotation is available to non-clinicians. That’s not a loophole: it’s the design. The skill this work tests is not clinical knowledge. It’s the ability to evaluate whether an AI response accurately reflects the source documents provided in the task.
The distinction matters because it defines your authority. You’re not diagnosing patients. You’re not applying your own medical knowledge to second-guess the AI. You’re checking whether what the AI said matches what the source documents say. A nurse who brings independent clinical knowledge into a task and overrides the source is introducing uncontrolled bias into training data. An annotator without clinical training who rigorously checks each claim against the source produces reliable signal.
This module gives you the structural knowledge to do that: how clinical notes are organized, what the terminology means, what HIPAA requires of the data you handle, and how to flag AI errors in medical responses without overstepping your role.
Clinical case study format
The core document type in medical AI training is the SOAP note. You’ll encounter these as reference documents and evaluate AI-generated versions. The structure is standard across clinical settings.
Chief Complaint (CC)
One or two sentences in the patient’s own words describing why they sought care.
- Correct: “I’ve had a sharp pain in my lower right abdomen for two days.”
- Incorrect: “Patient presents with appendicitis.”
The second version is a diagnosis, not a complaint. The chief complaint is what the patient reports. The diagnosis comes after assessment. An AI that conflates these has made a structural documentation error.
History of Present Illness (HPI)
A structured narrative covering the symptom in detail. Use the OLDCARTS framework:
- Onset: When did it start?
- Location: Where exactly?
- Duration: How long does it last?
- Character: What does it feel like? (sharp, dull, burning, throbbing)
- Allevating/Aggravating factors: What makes it better or worse?
- Radiation: Does it spread anywhere?
- Timing: Constant or intermittent? Pattern?
- Severity: Scale of 1–10; functional impact
An AI-generated HPI that skips multiple OLDCARTS elements is incomplete. Your evaluation should identify which elements are missing by name.
Past Medical History (PMH), Family History (FH), Social History (SH)
Relevant prior conditions, medications, allergies, family disease patterns, lifestyle factors (tobacco, alcohol, occupation, living situation). In AI training tasks, these sections test whether the model appropriately incorporates prior context into clinical reasoning. Allergies and contraindications are particularly important, as seen in the AI diagnosis Try It below.
Review of Systems (ROS)
A systematic inquiry of body systems beyond the chief complaint. Both positive findings (symptoms present) and pertinent negatives (symptoms the patient denies) are relevant. An AI that documents only positives and omits pertinent negatives is producing an incomplete clinical picture. Note the omissions specifically.
Physical Examination (PE)
Objective findings from examination. Written in clinical shorthand:
- HEENT: normocephalic, atraumatic; pupils equal, round, and reactive to light bilaterally
- Cardiovascular: regular rate and rhythm, no murmurs, rubs, or gallops
- Abdomen: soft, non-distended, tenderness to palpation in RLQ, positive McBurney’s point
Assessment and Plan (A/P)
The clinician’s diagnosis (or differential diagnoses) and the management plan. This is the highest-stakes section for AI evaluation. It is where clinical reasoning is most visible and where errors are most consequential.
Differential diagnosis format: The AI should list plausible diagnoses ranked by likelihood, with brief reasoning for each. An AI that states a single diagnosis without considering differentials is not modeling good clinical reasoning. Flag the absence of a differential when the case presentation is ambiguous.
Try It: classify the SOAP note element
You’re reviewing an AI-generated clinical note. Classify each excerpt into the correct SOAP section (S = Subjective, O = Objective, A = Assessment, P = Plan):
Excerpt 1: “Patient reports sharp, stabbing pain in the right lower quadrant for the past 36 hours, rated 8/10, worsening with movement.”
Excerpt 2: “Temperature 38.4°C, heart rate 102 bpm, abdomen tender to palpation at McBurney’s point, positive Rovsing’s sign.”
Excerpt 3: “Acute appendicitis, most likely. Differential includes ovarian cyst (less likely given male patient) and mesenteric adenitis.”
Excerpt 4: “Surgical consult ordered. NPO status initiated. IV fluids at 125 mL/hr. Repeat CBC in 4 hours.”
See answer
Excerpt 1 → S (Subjective). The patient’s own reported symptoms — what they feel, where, for how long, at what severity. Subjective data comes from what the patient says and cannot be independently measured.
Excerpt 2 → O (Objective). Measurable, observable clinical findings: vital signs and physical examination results. Objective data can be reproduced by another examiner.
Excerpt 3 → A (Assessment). The clinician’s diagnostic interpretation: most likely diagnosis plus the differential. Note the appropriate clinical reasoning — ranked differentials with brief justification for each. This is what good AI assessment looks like; an AI that states only “acute appendicitis” without the differential is producing an incomplete Assessment.
Excerpt 4 → P (Plan). Management actions: what will be done next. Includes orders, referrals, monitoring, and patient instructions.
A common AI error: placing diagnostic impressions in the Subjective section (“The patient seems to have appendicitis”) or mixing plan elements into the Assessment. When evaluating AI clinical notes, verify that each piece of information appears in its correct SOAP section.
Medical terminology: minimum vocabulary for annotation
You don’t need to memorize the Merck Manual. You need the high-frequency terms that appear in case studies.
Prefixes: brady- (slow), tachy- (fast), hypo- (under/low), hyper- (over/high), dys- (abnormal), a-/an- (without), peri- (around), sub- (under)
Suffixes: -itis (inflammation), -emia (blood condition), -ectomy (surgical removal), -plasty (repair), -ology (study of), -algia (pain), -pathy (disease of), -oscopy (visual examination)
Common terms to recognize:
- Tachycardia (fast heart rate), bradycardia (slow heart rate)
- Dyspnea (difficulty breathing), tachypnea (fast breathing rate)
- Edema (swelling from fluid), ascites (abdominal fluid accumulation)
- Afebrile (no fever), febrile (fever present)
- Pruritus (itching), erythema (redness)
- Syncope (fainting), presyncope (near-fainting)
When an AI uses clinical terminology, verify the term is used correctly. “The patient presented with acute dyspnea and tachycardia” is appropriate. “The patient had a tachycardia of the breathing” reflects a terminology error worth noting in your rationale.
HIPAA: what it means for AI training work
HIPAA (Health Insurance Portability and Accountability Act) governs the handling of Protected Health Information (PHI). In AI training contexts you’ll frequently encounter de-identified case studies or synthetic patient data. Understanding HIPAA matters for three reasons.
1. Recognizing PHI in training data
PHI includes 18 categories of identifiers under HIPAA’s Safe Harbor method. The most common in case study material: names, full dates (beyond year), geographic data smaller than state level, phone numbers, email addresses, Social Security numbers, medical record numbers (MRNs), health plan beneficiary numbers, IP addresses, photographs.
The Entry Simulation at the top of this module walked through a realistic example. The key takeaway: if a task document contains apparent real PHI that wasn’t intentionally included as a training example, flag it and stop. Don’t continue working with it as if it were synthetic.
2. The minimum necessary standard
Under 45 CFR §164.502(b), the minimum necessary standard requires limiting PHI access, use, and disclosure to what’s needed for the specific purpose. In annotation terms: if a task only requires evaluating clinical reasoning about a diagnosis, don’t request or use more patient identifying information than the task requires. If a case study is appropriately de-identified, don’t attempt to re-identify it.
3. De-identification methods
HIPAA recognizes two valid de-identification approaches:
- Safe Harbor: Remove all 18 identifier categories and replace with bracketed placeholders
- Expert Determination: A statistician certifies that re-identification risk is negligibly small
Synthetic patient data (generated for training purposes) is not subject to HIPAA if it contains no actual PHI. Most AI training case studies use synthetic or fully de-identified data precisely for this reason. Knowing this lets you work without overcautious behavior that slows annotation without adding safety.
Evaluating AI medical responses without being a clinician
The correct workflow is source-first, always:
- Read the provided source materials (clinical guidelines, case notes, drug references) before reading the AI response.
- Identify the specific claims the AI makes.
- Verify each claim against the sources: is this supported? Contradicted? Absent from the sources?
- Note your findings with specific source references. Cite the section or passage, not just “the case file.”
What you should not do: Substitute your own medical knowledge for source-based evaluation. If you think the AI might be wrong but can’t find the specific passage in the provided materials that confirms it, note that you couldn’t verify the claim. Don’t invent a counter-argument. Your evaluation authority ends at the source documents.
Red flags in AI medical responses to watch for:
- Specific drug dosages stated as universal facts (dosing is weight-, age-, renal function-, and indication-dependent)
- Diagnoses stated with certainty when the case presentation supports only a differential
- Treatment recommendations that contradict stated contraindications in the case file
- Clinical normal ranges stated without acknowledging population variability
Try It: AI diagnosis contradicts the case file
You’re evaluating an AI-generated Assessment & Plan section. The provided case file states:
“Patient has a documented penicillin allergy (anaphylaxis). Culture results: gram-positive cocci in clusters, consistent with Staphylococcus aureus. Sensitivity panel: Methicillin-resistant (MRSA confirmed).”
The AI’s Assessment & Plan reads:
“Assessment: Skin and soft tissue infection, MRSA. Plan: Initiate amoxicillin-clavulanate 875/125 mg PO twice daily for 10 days.”
What do you flag, how do you document it, and what should you NOT do?
See answer
Flag two distinct errors, both sourced directly to the case file:
Error 1 (Contraindication violation): The patient has a documented penicillin allergy (anaphylaxis). Amoxicillin-clavulanate is a penicillin-class antibiotic. The AI’s treatment plan directly contradicts a stated contraindication in the case file.
Error 2 (Incorrect drug selection for the confirmed pathogen): The sensitivity panel confirms MRSA (methicillin-resistant Staphylococcus aureus). Amoxicillin-clavulanate (a beta-lactam) has no activity against MRSA. The plan would be clinically ineffective even without the allergy.
How to document: Cite the specific case file passages. Example rationale: “The AI’s plan recommends amoxicillin-clavulanate. The case file documents a penicillin allergy (anaphylaxis), making penicillin-class antibiotics contraindicated. Additionally, the sensitivity panel confirms MRSA, which is resistant to amoxicillin-clavulanate. The plan contradicts two explicit source document findings.”
What NOT to do: Do not suggest an alternative treatment (for example, “should have used vancomycin”) unless the case file explicitly states it. Your role is to identify what the AI got wrong relative to the source, not to provide clinical guidance. The moment you recommend treatment, you’ve stepped outside your evaluation role and introduced your own judgment into what should be source-based annotation.
Quick Reference
- Evaluate against the source, not from memory: Your job is to check whether the AI’s claims match the provided documents, not to apply your own clinical knowledge. If you can’t find a passage in the source that supports or contradicts the AI’s claim, note that you couldn’t verify it. Don’t substitute judgment for evidence.
- SOAP structure and where AI errors appear: Subjective = what the patient reports; Objective = measurable findings; Assessment = diagnosis and differential; Plan = management actions. Common AI errors: putting diagnoses in the Subjective section, omitting differentials in the Assessment, or contradicting contraindications stated elsewhere in the case file.
- PHI means all 18 identifiers, not just names: Full dates, zip codes, MRNs, phone numbers, email addresses, and health plan numbers are all PHI. If a task document contains apparent real PHI that wasn’t intentionally included, flag it and stop. Don’t continue working with it as if it were synthetic.