Prerequisites: This module assumes you can read Python code and understand basic programming concepts: variables, loops, functions, conditionals. If youβve never written Python before, work through any free beginner Python course before coming back. This module wonβt teach you to program from scratch. It will teach you to evaluate code the way a professional reviewer does.
Before you read anything: make a call
A user asked an AI: βWrite a Python function that returns the second largest number in a list.β
The AI produced this code and this explanation:
def second_largest(nums):
nums.sort(reverse=True)
return nums[1]
AIβs explanation: βThis function sorts the list in descending order and returns the element at index 1, which is the second largest number.β
Is the AIβs explanation correct? Is the code correct?
See the answer
The explanation is accurate for the happy path. The code has three bugs.
The AIβs explanation correctly describes what the code does when given a well-formed list of distinct numbers with at least two elements. Thatβs the happy path. Reviewers who only check the happy path miss what gets code rejected.
Bug 1: Empty or single-element list. If nums is [] or [5], then nums[1] raises an IndexError. The function has no guard for this. A task asking for βthe second largestβ implicitly requires handling the case where no second largest exists.
Bug 2: Duplicate values. If nums = [5, 5, 3], sorting gives [5, 5, 3] and nums[1] returns 5. But the second largest distinct value is 3. Whether this is a bug depends on the task spec, but itβs the kind of ambiguity that must be flagged, not assumed away.
Bug 3: Input mutation. nums.sort() sorts the list in place, modifying the callerβs original list. This is an undocumented side effect. The function should sort a copy: sorted(nums, reverse=True).
A correct implementation:
def second_largest(nums: list) -> int | None:
unique = sorted(set(nums), reverse=True)
return unique[1] if len(unique) >= 2 else None
This is what code review annotation looks like. The AIβs prose explanation wasnβt wrong. It accurately described the codeβs behavior. But βaccurately describes a buggy functionβ is not a passing rationale.
What coding tasks look like on AI training platforms
Coding annotation work falls into two categories, and they require different skills.
Algorithms tasks: You solve or evaluate algorithmic problems: sorting, searching, dynamic programming, graph traversal. The AI generates a solution, and you verify whether itβs correct, optimal, and handles edge cases. Or you write a reference solution from scratch that the model trains on. These tasks require genuine Python fluency and pay at the Specialist tier.
Code review sessions: You receive AI-generated code and evaluate it across multiple dimensions: correctness, time and space complexity, code style, edge case handling, and sometimes security. Less writing, more reading. These appear across Specialist and Subject Matter Expert tiers depending on the domain.
Both task types require genuine Python fluency. Not tutorial-level familiarity, but the ability to mentally execute code, identify failure modes, and reason about performance.
Algorithmic complexity: what you must know cold
Every algorithms task touches Big O notation. You need to recognize common complexities on sight:
| Complexity | Example pattern |
|---|---|
| O(1) | Hash table lookup, array index access |
| O(log n) | Binary search, balanced BST operations |
| O(n) | Single loop over input, linear scan |
| O(n log n) | Merge sort, heap sort, most efficient comparison sorts |
| O(nΒ²) | Nested loops, bubble sort, insertion sort (worst case) |
| O(2βΏ) | Recursive subset generation, naive Fibonacci |
When evaluating AI-generated code, check both time complexity and space complexity. An AI solution that solves a problem in O(nΒ²) when O(n log n) is achievable is not wrong, but itβs suboptimal, which matters at scale and should be noted in your rationale. An AI that implements βfind duplicates in a listβ using two nested loops gives O(nΒ²) when a hash set achieves O(n) time at the cost of O(n) space. Your annotation should flag the suboptimal complexity and explain the better approach.
Try It: O(nΒ²) vs. O(n): is it wrong?
Youβre reviewing two AI implementations of a function that checks whether any two numbers in a list sum to a target value. Both produce correct output on all test cases.
Response A:
def has_pair_sum(nums, target):
for i in range(len(nums)):
for j in range(i + 1, len(nums)):
if nums[i] + nums[j] == target:
return True
return False
Response B:
def has_pair_sum(nums, target):
seen = set()
for num in nums:
if target - num in seen:
return True
seen.add(num)
return False
Is Response A wrong in a code review context? How should you score it relative to Response B?
See answer
Response A is not wrong, but it is suboptimal and should be flagged.
Response A is O(nΒ²) time due to the nested loops. For small inputs this is unnoticeable, but at scale it degrades significantly. Response B is O(n) time using a hash set. It trades O(n) space for a linear time solution.
In a code review annotation context, correctness and complexity are separate criteria. Response A passes the correctness criterion (produces correct output) but fails the complexity criterion (O(nΒ²) when O(n) is straightforwardly achievable).
Your rationale should: confirm correctness, identify the nested loop pattern and its O(nΒ²) complexity, explain that a hash set approach achieves O(n) by trading O(n) space, and note that Response B demonstrates the preferred approach.
Do not mark Response A as βincorrectβ β that misrepresents the evaluation. Mark it as correct but suboptimal, and score it lower on the complexity criterion. These are different rubric dimensions.
Writing clean Python: the standards code reviewers apply
PEP 8 is the official Python style guide and the baseline for every code review rubric. The rules most commonly violated in AI-generated code: 4-space indentation (not tabs), snake_case for variables and functions (my_variable, not myVariable), PascalCase for class names, maximum line length of 79 characters (79β99 is acceptable in many projects), and two blank lines between top-level function or class definitions.
Naming matters more than most annotators flag it. for i in range(len(lst)) should almost always be for item in lst. Variable names like x, tmp, data in functions longer than three lines are a signal that the AI generated plausible-looking code without thinking about readability.
Anti-patterns to flag on sight, and what goes wrong with each:
except: with no exception type catches everything, including KeyboardInterrupt and SystemExit. The program can no longer be interrupted cleanly. Almost always wrong; the correct form specifies what to catch: except ValueError:.
Mutable default arguments (like def f(lst=[]):) create a shared object across all calls. The first call that appends to lst will see those items on every subsequent call. Itβs one of Pythonβs most counterintuitive bugs and shows up regularly in AI-generated code.
is for value comparison (like if x is 5:) checks object identity, not equality. It may pass for small integers (CPython caches them) but fails unpredictably for larger values or strings. The correct form is ==.
Spotting bugs: a systematic approach
Donβt mentally run code only on the happy path. Apply this checklist:
- Off-by-one errors: Loop bounds, slice indices, length checks.
range(n)goes to n-1.lst[:-1]excludes the last element. - Empty input: What happens when the input list is empty? When a string is empty? When n=0?
- Type errors: The function expects an int but might receive a string. The AI may not validate types.
- Mutation of input: Does the function modify its arguments without documenting it? Often it shouldnβt.
- Implicit None returns: A function that returns
Noneon an unhandled branch β without documenting it β is a common AI bug. - Performance inside a loop:
inchecks on lists, repeated.sort()calls, string concatenation in a loop. Each is O(n) inside a loop, making the whole function O(nΒ²).
Try It: spot the bugs
Youβre reviewing this AI-generated Python function. The task was: βWrite a function that takes a list of words and returns a dictionary mapping each word to the number of times it appears. Words should be case-insensitive.β
def word_count(words):
counts = {}
for word in words:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
return counts
Identify all bugs or issues in this implementation.
See answer
There are two issues:
1. Case-insensitivity is not implemented. The task explicitly required case-insensitive counting β "Hello" and "hello" should map to the same key. The function compares words as-is, so "Hello" and "hello" produce separate entries. The fix: normalize on ingestion with word.lower().
2. The manual dictionary pattern is unnecessary. The logic is correct but verbose β Pythonβs collections.Counter or dict.get() pattern handles this more cleanly. This is a style/quality issue, not a correctness bug, but worth noting in the rationale.
A correct implementation:
def word_count(words: list[str]) -> dict[str, int]:
counts = {}
for word in words:
word = word.lower()
counts[word] = counts.get(word, 0) + 1
return counts
Or more concisely:
from collections import Counter
def word_count(words: list[str]) -> dict[str, int]:
return Counter(word.lower() for word in words)
The case-insensitivity bug is the one that matters. Itβs a direct spec violation. The verbosity is a quality note. Score them on separate criteria and explain both in your rationale.
Security issues in code review
For code review sessions involving backend or scripting code, security awareness matters. You donβt need to be a security engineer. You need to recognize the pattern, name the vulnerability class, and explain why itβs dangerous.
SQL injection is the most common critical flag. Any code that builds SQL queries with string formatting is vulnerable:
query = f"SELECT * FROM users WHERE name = '{name}'"
A crafted input can manipulate the query logic entirely. Parameterized queries are the required fix. The value is passed separately, treated as data, not executable SQL.
Hardcoded credentials are a critical violation on sight:
API_KEY = "sk-abc123secretkey"
DB_PASSWORD = "hunter2"
Keys and passwords in source code get committed to version control, shared, and leaked. Flag immediately. No context needed.
Arbitrary code execution via eval() or exec() on user-supplied input is almost always dangerous. Flag any use of these on untrusted data.
Path traversal: file operations using user-supplied paths without sanitization can allow an attacker to read or write files outside the intended directory. open(user_input) with no validation is the pattern to catch.
Try It: spot the security vulnerability
Youβre reviewing a Python backend function that queries a database. The task asked the AI to write a function that retrieves a user record by username.
def get_user(username):
conn = get_db_connection()
cursor = conn.cursor()
query = f"SELECT * FROM users WHERE username = '{username}'"
cursor.execute(query)
return cursor.fetchone()
Identify the security vulnerability, name its class, and explain what a correct implementation looks like.
See answer
Vulnerability: SQL Injection.
The function constructs the SQL query by directly embedding a user-supplied string (username) into the query using an f-string. An attacker can pass a crafted username to manipulate the query logic.
If username = "admin' --", the executed query becomes:
SELECT * FROM users WHERE username = 'admin' --'
The -- comments out the rest of the query, and the attacker gains access to the admin account without knowing the password.
Correct implementation using parameterized queries:
def get_user(username):
conn = get_db_connection()
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE username = ?", (username,))
return cursor.fetchone()
The ? placeholder passes the value separately, ensuring itβs treated as data, not executable SQL. The database driver handles the escaping.
This is a mandatory flag in any code review task. Mark it as a critical security defect: not a style issue, not a complexity issue. A different rubric dimension with a hard fail.
What βcode reviewβ means as an annotation task
In a code review annotation session, you receive a prompt describing what the code should do, AI-generated code attempting to fulfill it, and a rubric with criteria: correctness, complexity, style, edge case handling, documentation. Your job is not to rewrite the code. It is to evaluate it on each criterion with specific, evidence-based rationale.
βThe code is not very efficientβ fails the rationale standard. βThe code uses a nested loop (lines 8β12) giving O(nΒ²) time complexity; a hash-set approach would achieve O(n)β meets it.
The best code reviewers think like engineers: they run the code mentally on normal inputs, boundary inputs (empty, single element, maximum size), and adversarial inputs (duplicates, negatives, None). They score each criterion separately, because passing one doesnβt excuse failing another.
Quick Reference
- Correctness before everything else: Check that the code produces correct output for normal inputs, empty inputs, and edge cases before evaluating style or complexity. A beautifully formatted function with a bug in the empty-input case fails the task.
- Complexity and correctness are separate criteria: An O(nΒ²) solution that produces correct output is not βwrongβ. Itβs correct and suboptimal. Score it that way. Do not conflate the two rubric dimensions.
- The bugs that appear most often in AI-generated Python: Missing edge case guards (empty list, single element), input mutation (sorting in place without documenting it), and O(n) operations inside loops that make the whole function O(nΒ²).