Before you read anything: make a call
The task: return all products that have never been ordered.
The schema has two tables: products (id, name) and order_items (order_id, product_id).
The sample data has 3 products, all of which have at least one order. Both queries below return an empty result set on the test data. But only one is correct in the general case.
Query A:
SELECT p.id, p.name
FROM products p
LEFT JOIN order_items oi ON p.id = oi.product_id
WHERE oi.product_id IS NULL;
Query B:
SELECT p.id, p.name
FROM products p
WHERE p.id NOT IN (SELECT product_id FROM order_items);
Both return the same output on the sample data. Which one is correct, and what condition causes the other to fail?
See the answer
Query A is correct. Query B is wrong, and it will fail silently in production.
NOT IN with a subquery breaks when the subquery returns any NULL values. If any row in order_items has a NULL product_id (which happens in real data with optional foreign keys or incomplete imports), then NOT IN (...) evaluates to NULL (unknown) for every product. The entire query returns zero rows, even for products that genuinely have no orders.
The sample data has no NULL product_id values, so the bug is invisible during testing. On clean test data, Query B looks fine. On production data, it silently drops valid results.
Query A uses the LEFT JOIN ... WHERE IS NULL pattern: the standard safe approach for βrows in A with no match in B.β It explicitly checks for the absence of a match rather than relying on subquery behavior that NULL can corrupt.
This is the hardest class of SQL bug to catch: a query that executes without error, returns plausible results on test data, and produces wrong results on real data. Your job as an annotator is to reason about what a query does, not just whether it runs.
SQL evaluation is about semantics, not syntax
Most SQL annotation mistakes come from treating queries the same way a linter does: checking whether theyβll execute. Thatβs the wrong frame.
AI-generated SQL frequently executes without error and returns the wrong answer. A query that returns 847 rows when the correct answer is 1,000 rows produces no error message. A LEFT JOIN silently replaced with INNER JOIN will drop rows and say nothing about it. NOT IN with a subquery will return zero results if the subquery contains a NULL, and the only way to catch it is to know how NULL behaves in SQL.
Your job is to read a query and reason about what it will return, including what it gets wrong quietly.
Fundamentals you must know without reference
SELECT, WHERE, GROUP BY, HAVING
The most common AI errors happen at the GROUP BY/HAVING boundary. The rule:
WHEREfilters rows before aggregationHAVINGfilters after aggregation, on aggregate results likeCOUNT,SUM,AVG
Common AI error:
-- WRONG: aggregate function in WHERE clause
SELECT department, COUNT(*) AS headcount
FROM employees
WHERE COUNT(*) > 10
GROUP BY department;
-- CORRECT: use HAVING for post-aggregation filters
SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;
The first query raises a syntax error in most databases. The AI generates it anyway. You need to spot it, and more importantly, explain why itβs wrong in your rationale.
JOIN types and their semantics
| JOIN type | Returns |
|---|---|
| INNER JOIN | Only rows with matches in both tables |
| LEFT JOIN | All rows from left table; NULLs for unmatched right rows |
| RIGHT JOIN | All rows from right table; NULLs for unmatched left rows |
| FULL OUTER JOIN | All rows from both tables; NULLs where no match |
| CROSS JOIN | Cartesian product: every combination of rows |
AI models frequently use INNER JOIN when the task semantics require LEFT JOIN. The symptom: the query silently drops rows for customers, orders, or users that have no matching record in the joined table. If the task says βshow all customers and their order counts,β customers with no orders should appear with a count of 0. That requires LEFT JOIN, not INNER JOIN.
NULL semantics
NULL in SQL means unknown. It does not mean zero, not empty string, not false. This has consequences that catch annotators constantly.
Any comparison with NULL returns NULL:
NULL = 5 -- NULL (not FALSE)
NULL != 5 -- NULL (not TRUE)
NULL = NULL -- NULL (not TRUE)
To test for NULL, you must use IS NULL or IS NOT NULL.
The WHERE clause implication: WHERE status != 'cancelled' will silently exclude rows where status is NULL, because NULL != 'cancelled' evaluates to NULL, not TRUE. If those rows should be included, the correct filter is:
WHERE status != 'cancelled' OR status IS NULL
This is one of the most common silent correctness errors in AI-generated SQL.
Try It: INNER JOIN vs. LEFT JOIN
The task: βWrite a SQL query that returns each customerβs name and the total number of orders they have placed. Include customers who have placed no orders β show 0 for them.β
The AI generates this query against a customers table and an orders table, where orders.customer_id references customers.id:
SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;
Is this query correct? If not, what is the specific error and how should it be fixed?
See answer
The query is incorrect. INNER JOIN will silently exclude any customer who has no orders, because INNER JOIN only returns rows where a match exists in both tables. Customers with no orders have no rows in orders, so they are dropped from the result entirely.
The task explicitly requires including customers with zero orders. No error message will appear; the query simply returns fewer rows than required. This is a semantic correctness error.
Correct query:
SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;
LEFT JOIN preserves all rows from the left table (customers). Where no matching order exists, o.id is NULL, and COUNT(o.id) correctly returns 0 because COUNT ignores NULLs.
One subtlety worth noting in your rationale: COUNT(*) would return 1 for unmatched rows, not 0. COUNT(o.id) is correct here because it counts non-NULL values only. If the AI used COUNT(*) instead, thatβs a second error on top of the JOIN type.
Window functions
Window functions are a common source of AI errors and a reliable signal of query sophistication:
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
Key distinctions between the three ranking functions:
RANK(): tied rows get the same rank; next rank is skipped (1, 1, 3)DENSE_RANK(): tied rows get the same rank; no ranks skipped (1, 1, 2)ROW_NUMBER(): always unique sequential numbers regardless of ties (1, 2, 3)
An AI that uses RANK() when the task requires unique row numbering (or DENSE_RANK() when the task requires gap-skipping) produces a semantically wrong result even though the query executes without error.
Try It: WHERE COUNT(*) β what's wrong?
A task asks for βall departments with more than 5 employees.β The AI writes:
SELECT department, COUNT(*) AS headcount
FROM employees
WHERE COUNT(*) > 5
GROUP BY department;
Explain precisely what is wrong with this query and write the corrected version.
See answer
The error: WHERE COUNT(*) > 5 is invalid SQL. The WHERE clause filters individual rows before GROUP BY runs. Aggregate functions like COUNT() donβt exist yet at the point WHERE is evaluated. Using an aggregate in WHERE causes a syntax error in virtually all SQL dialects.
The fix:
SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING COUNT(*) > 5;
HAVING filters after aggregation, which is where aggregate functions are valid.
Why the rationale matters: A rationale that says βthis query has a syntax errorβ is less useful than one that says βaggregate functions cannot appear in WHERE clauses because WHERE operates before GROUP BY; HAVING is the correct clause for post-aggregation filters.β The first tells a reviewer the query breaks. The second tells the model why it breaks and what the correct pattern is. This is what training data is for.
Schema reading
When evaluating AI SQL, read the schema before reading the query. Common schema-related errors:
Wrong table: The AI joins orders to products when it should join order_items to products, because orders doesnβt have a direct product foreign key, it goes through order_items.
Hallucinated column names: AI generates customer_name when the schema has first_name and last_name as separate columns. The query references a column that doesnβt exist.
Aggregation without grouping: SELECT department, AVG(salary) FROM employees without GROUP BY department is invalid in most SQL dialects, or returns a single row with an arbitrary department value.
Reading the schema takes 30 seconds. It catches a class of errors that no amount of SQL knowledge will help you spot without it.
Data analysis patterns: Python and pandas
Data science tasks often pair SQL with Python pandas code. The evaluation principles are the same β reason about what the code will actually return, not just whether it runs.
groupby with aggregation:
# Common AI mistake: not resetting the index
result = df.groupby('department')['salary'].mean()
# Returns a Series with department as index, not a clean DataFrame
# Correct for most downstream use:
result = df.groupby('department')['salary'].mean().reset_index()
Merging DataFrames: The pandas equivalent of SQL JOINs. pd.merge(df1, df2, on='id', how='left'): the how parameter mirrors SQL JOIN semantics. An AI that defaults to how='inner' (the pandas default) when the task requires preserving all rows from one table is making the same semantic error as the INNER JOIN case above.
Handling missing data: df.dropna() versus df.fillna(0) have very different effects on aggregate statistics. An AI that silently drops NaN rows before computing a mean may produce a misleading result. The task might require acknowledging that data is missing rather than removing it.
Pandas is covered here at the level youβll encounter in most annotation tasks. If a task requires deeper pandas work (complex transformations, time series, multi-index operations), treat it as Specialist-tier and evaluate accordingly.
Common SQL evaluation mistakes by annotators
Accepting a query that looks right. SQL that executes and returns rows is not necessarily correct. Always reason about what the query should return and whether the approach achieves it.
Ignoring NULL handling. NULL semantics are non-obvious and AI models get them wrong frequently. Check any WHERE clause involving inequality or NOT IN for potential NULL issues.
Not reading the schema. An AI may write a perfectly syntactic query against a different table structure than the one provided. Check column names, foreign key paths, and table relationships before evaluating the query logic.
Accepting the sample output as proof. As the Entry Simulation showed: a query can return the correct result on clean test data and produce wrong results on real data. Sample output is a starting point, not a verdict.
Quick Reference
- The silent failure pattern: AI SQL errors that produce no error message and return plausible-but-wrong results are the hardest to catch.
INNER JOINsilently drops rows,NOT INsilently returns nothing when NULLs are present,WHEREon an aggregate raises a syntax error, but only the last one is visible without reasoning through the query. - NULL is unknown, not zero: Any comparison with NULL returns NULL, not TRUE or FALSE.
WHERE col != 'x'excludes NULL rows.NOT INwith a subquery returns zero results if the subquery contains any NULL. Always check NULL handling in inequality filters and subqueries. - Read the schema before the query: Column names, foreign key paths, and table structure must match the query. A syntactically valid query that references the wrong table or a hallucinated column name is a correctness failure, and you wonβt catch it without reading the schema first.