SQL appears in roughly 95% of data analyst interviews, and it’s the most common reason candidates are rejected. The good news: the questions aren’t as varied as they look. From DataLemur and StrataScratch data on thousands of screens at Amazon, Meta, Google, Uber, Stripe, Airbnb, DoorDash, LinkedIn, and Walmart, seven patterns account for roughly 80% of what gets asked. Learn these cold and you’re prepared for nearly any SQL interview surface, entry-level through mid.
Each pattern below follows the same shape: the verbatim FAANG-style framing, the canonical solution, and the trap variant interviewers use to separate candidates who memorized syntax from candidates who’ve been bitten by real data. Where it fits the schema, there’s also an exercise against the dim_customers / dim_campaigns / fact_sends / fact_purchases star schema that the other caseSQL tutorials use.
Why these seven
Interviewers aren’t testing whether you’ve memorized SQL syntax — reference docs exist. They’re testing whether you can recognize the *shape* of a business question and map it to a query pattern without stalling. These seven patterns are the shapes that come up over and over because they each answer a category of real analyst question: “top N by something,” “trend over time,” “find pairs,” “find who’s missing,” “dedupe,” “pivot,” “bucket by time.”
Pattern 1 — Top-N per group
The single most-asked pattern. Verbatim examples from the wild: *"Find the 2nd highest-grossing product in each Amazon category"* (Amazon), *"Find each user’s 3rd transaction on the Uber platform"* (Uber), *"For every product keyword, find products with at least one position with more than 80% highly relevant displays"* (Etsy). All the same shape: partition by a grouping column, order by a ranking metric, pick the Nth row per partition.
The canonical solution is a window function wrapped in a CTE, then filtered on the rank:
WITH ranked AS (
SELECT
product_id,
category,
revenue,
ROW_NUMBER() OVER (
PARTITION BY category
ORDER BY revenue DESC
) AS rn
FROM products
)
SELECT product_id, category, revenue
FROM ranked
WHERE rn = 2; -- 2nd highest-grossing per categoryFind the second-highest purchase amount for each customer. If a customer has only one purchase, exclude them.
fact_purchases(customer_id, purchase_amount, purchased_at)
Rows with customer_id and their 2nd-highest purchase_amount. Customers with a single purchase don't appear.
Show solution
WITH ranked AS (
SELECT
customer_id,
purchase_amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY purchase_amount DESC
) AS rn
FROM fact_purchases
)
SELECT customer_id, purchase_amount
FROM ranked
WHERE rn = 2;Pattern 2 — Running totals and rolling averages
Every product screen and every finance screen uses this pattern. Verbatim: *"What’s the 7-day rolling average of Daily Active Users?"* (Meta), *"Maintain a cumulative transaction balance over time per account"* (Stripe), *"For each user who posted at least twice in 2024, find the number of days between their first and last post"* (Meta).
The canonical solution is SUM() OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN ...) for windows, and LAG() for period-over-period deltas:
-- 7-day rolling average of DAU
SELECT
event_date,
AVG(dau) OVER (
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS dau_7d_avg
FROM daily_active_users
ORDER BY event_date;
-- Week-over-week delta
SELECT
event_date,
dau,
dau - LAG(dau, 7) OVER (ORDER BY event_date) AS wow_delta
FROM daily_active_users;Pattern 3 — Self-joins for pair detection
Pairs of rows from the same table, usually with a time or identity constraint between them. Verbatim: *"Find frequently purchased pairs of items"* (Walmart), *"Identify repeated payments at the same merchant with the same credit card for the same amount within 10 minutes"* (Stripe), *"Detect overlapping subscription periods per customer"* (subscription businesses).
-- Item pairs bought together in the same order
SELECT
a.item_id AS item_a,
b.item_id AS item_b,
COUNT(*) AS pair_count
FROM order_items a
JOIN order_items b
ON a.order_id = b.order_id
AND a.item_id < b.item_id -- note the < operator
GROUP BY a.item_id, b.item_id
ORDER BY pair_count DESC
LIMIT 20;Pattern 4 — LEFT JOIN with NULL filter ("who never did X")
Find rows in one table with no matching rows in another. Verbatim: *"Find users who registered but never made a purchase"*, *"Find neighborhoods with zero users"*, *"Find customers who placed their first order but never a second"* (retention / churn).
SELECT u.id, u.email
FROM users u
LEFT JOIN purchases p ON p.user_id = u.id
WHERE p.user_id IS NULL;Find customers who have received at least one campaign send but have never made a purchase.
dim_customers(id, email), fact_sends(customer_id, campaign_id, sent_at), fact_purchases(customer_id, purchased_at)
One row per qualifying customer: id, email. Ordered by id.
Show solution
SELECT DISTINCT c.id, c.email
FROM dim_customers c
JOIN fact_sends s ON s.customer_id = c.id
LEFT JOIN fact_purchases p ON p.customer_id = c.id
WHERE p.customer_id IS NULL
ORDER BY c.id;Pattern 5 — Deduplication keeping the latest
Given duplicate rows, keep only the most recent record per entity. Verbatim: *"Keep only the most recent update per user_id"* (data cleaning, every team), *"Identify duplicate job listings accidentally posted by the same company"* (LinkedIn).
WITH latest AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY updated_at DESC
) AS rn
FROM user_snapshots
)
SELECT *
FROM latest
WHERE rn = 1;Alternatives to mention: Postgres’ SELECT DISTINCT ON (user_id) ... ORDER BY user_id, updated_at DESC, or a correlated subquery with MAX(updated_at). Narrate why you picked the window-function form: it's dialect-portable and reads as one scan instead of a correlated per-row lookup.
Pattern 6 — Pivoting with conditional aggregation
Turn rows into columns with SUM + CASE. Verbatim: *"Build a quarterly spend report with one column per department"* (finance), *"Compute the average star rating per product per month, with months as columns"* (Amazon).
SELECT
DATE_TRUNC('quarter', booked_at) AS quarter,
SUM(CASE WHEN dept = 'IT' THEN amount END) AS it_spend,
SUM(CASE WHEN dept = 'HR' THEN amount END) AS hr_spend,
SUM(CASE WHEN dept = 'Eng' THEN amount END) AS eng_spend,
SUM(CASE WHEN dept NOT IN ('IT','HR','Eng') THEN amount END) AS other_spend
FROM expenses
GROUP BY 1
ORDER BY 1;Pattern 7 — Date truncation and time bucketing
Aggregate a timestamp column to a coarser time grain. Verbatim: *"For each month, count Facebook users who created their account that month"* (Meta), *"Aggregate transactions into weekly time series"*, *"Compute month-over-month growth"*.
SELECT
DATE_TRUNC('month', created_at) AS signup_month,
COUNT(*) AS new_users
FROM users
GROUP BY 1
ORDER BY 1;Eight trick questions to memorize
Beyond the seven patterns, a short list of single-answer gotchas shows up in virtually every screen. These aren’t patterns — they’re tests of whether you've actually been bitten by real data. Each one is worth rehearsing until your answer is reflexive:
- NULLs in aggregates.
COUNT(*)counts all rows;COUNT(col)counts non-NULL.AVG(col)ignores NULL, soAVG(col) ≠ SUM(col)/COUNT(*)when NULLs exist. - GROUP BY and NULL. NULLs group together as a single bucket, even though
NULL = NULLis UNKNOWN elsewhere. - Inclusive vs exclusive date bounds.
BETWEEN '2025-01-01' AND '2025-01-31'silently drops events after 00:00 on Jan 31 if the column is a timestamp. Prefer>= start AND < next_start. - HAVING vs WHERE. WHERE filters rows before aggregation; HAVING filters groups after. Aggregate functions in WHERE is an instant fail.
- Row explosion from joins. A JOIN that introduces duplicates makes
COUNT(user_id)overstate; useCOUNT(DISTINCT user_id)or aggregate before joining. - LEFT JOIN → INNER via WHERE. Putting a filter on the right table in the WHERE clause converts the LEFT JOIN silently. Put it in the ON clause instead.
- Integer division.
SUM(successes) / COUNT(*)returns 0 in Postgres and MySQL when both are ints. Cast one to float:SUM(successes) * 1.0 / COUNT(*). - Timezones. UTC event timestamps to user-local time: know
AT TIME ZONEin Postgres, and be ready to discuss DST.
If you recognize only three or four of those as things you'd catch live, your next prep session should be the traps category — these are exactly the gotchas built into the explainers there.
How to practice these for real
Recognition is cheap; *reflex* is the goal. You want the mental move from "this is a top-N per group" to a written window-function CTE to happen in under 15 seconds on a whiteboard. Three concrete practice loops:
- Drill the patterns against realistic data. The caseSQL mission path at /path/marketing is set inside a simulated marketing team — every mission uses pattern 1, 2, 4, 5, 6, or 7 against a star schema with planted data-quality issues. Cheaper and stickier than abstract puzzles because the queries answer a business question.
- Drill the traps separately. The Interview Prep → Traps category has single-answer explainers for each of the eight gotchas above. Do them once a week until the answers are automatic.
- Rehearse narration out loud. The thing you're testing is whether you can *explain* the pattern choice, not whether you can type it. Open the prep guide for the hidden rubric interviewers actually grade on — narrating trade-offs is on the list.