Technique

The Seven SQL Patterns That Cover 80% of Data Analyst Interviews

Every FAANG SQL screen draws from the same seven patterns. This is the primer — the verbatim questions, the canonical solutions, and the trap variants that separate strong candidates.

12 min read

SQL appears in roughly 95% of data analyst interviews, and it’s the most common reason candidates are rejected. The good news: the questions aren’t as varied as they look. From DataLemur and StrataScratch data on thousands of screens at Amazon, Meta, Google, Uber, Stripe, Airbnb, DoorDash, LinkedIn, and Walmart, seven patterns account for roughly 80% of what gets asked. Learn these cold and you’re prepared for nearly any SQL interview surface, entry-level through mid.

Each pattern below follows the same shape: the verbatim FAANG-style framing, the canonical solution, and the trap variant interviewers use to separate candidates who memorized syntax from candidates who’ve been bitten by real data. Where it fits the schema, there’s also an exercise against the dim_customers / dim_campaigns / fact_sends / fact_purchases star schema that the other caseSQL tutorials use.

Why these seven

Interviewers aren’t testing whether you’ve memorized SQL syntax — reference docs exist. They’re testing whether you can recognize the *shape* of a business question and map it to a query pattern without stalling. These seven patterns are the shapes that come up over and over because they each answer a category of real analyst question: “top N by something,” “trend over time,” “find pairs,” “find who’s missing,” “dedupe,” “pivot,” “bucket by time.”

Pattern 1 — Top-N per group

The single most-asked pattern. Verbatim examples from the wild: *"Find the 2nd highest-grossing product in each Amazon category"* (Amazon), *"Find each user’s 3rd transaction on the Uber platform"* (Uber), *"For every product keyword, find products with at least one position with more than 80% highly relevant displays"* (Etsy). All the same shape: partition by a grouping column, order by a ranking metric, pick the Nth row per partition.

The canonical solution is a window function wrapped in a CTE, then filtered on the rank:

WITH ranked AS (
  SELECT
    product_id,
    category,
    revenue,
    ROW_NUMBER() OVER (
      PARTITION BY category
      ORDER BY revenue DESC
    ) AS rn
  FROM products
)
SELECT product_id, category, revenue
FROM ranked
WHERE rn = 2;   -- 2nd highest-grossing per category
Exercise

Find the second-highest purchase amount for each customer. If a customer has only one purchase, exclude them.

Schema hint

fact_purchases(customer_id, purchase_amount, purchased_at)

Expected

Rows with customer_id and their 2nd-highest purchase_amount. Customers with a single purchase don't appear.

Show solution
WITH ranked AS (
  SELECT
    customer_id,
    purchase_amount,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY purchase_amount DESC
    ) AS rn
  FROM fact_purchases
)
SELECT customer_id, purchase_amount
FROM ranked
WHERE rn = 2;

Pattern 2 — Running totals and rolling averages

Every product screen and every finance screen uses this pattern. Verbatim: *"What’s the 7-day rolling average of Daily Active Users?"* (Meta), *"Maintain a cumulative transaction balance over time per account"* (Stripe), *"For each user who posted at least twice in 2024, find the number of days between their first and last post"* (Meta).

The canonical solution is SUM() OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN ...) for windows, and LAG() for period-over-period deltas:

-- 7-day rolling average of DAU
SELECT
  event_date,
  AVG(dau) OVER (
    ORDER BY event_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS dau_7d_avg
FROM daily_active_users
ORDER BY event_date;

-- Week-over-week delta
SELECT
  event_date,
  dau,
  dau - LAG(dau, 7) OVER (ORDER BY event_date) AS wow_delta
FROM daily_active_users;

Pattern 3 — Self-joins for pair detection

Pairs of rows from the same table, usually with a time or identity constraint between them. Verbatim: *"Find frequently purchased pairs of items"* (Walmart), *"Identify repeated payments at the same merchant with the same credit card for the same amount within 10 minutes"* (Stripe), *"Detect overlapping subscription periods per customer"* (subscription businesses).

-- Item pairs bought together in the same order
SELECT
  a.item_id AS item_a,
  b.item_id AS item_b,
  COUNT(*) AS pair_count
FROM order_items a
JOIN order_items b
  ON  a.order_id = b.order_id
  AND a.item_id  < b.item_id   -- note the < operator
GROUP BY a.item_id, b.item_id
ORDER BY pair_count DESC
LIMIT 20;

Pattern 4 — LEFT JOIN with NULL filter ("who never did X")

Find rows in one table with no matching rows in another. Verbatim: *"Find users who registered but never made a purchase"*, *"Find neighborhoods with zero users"*, *"Find customers who placed their first order but never a second"* (retention / churn).

SELECT u.id, u.email
FROM users u
LEFT JOIN purchases p ON p.user_id = u.id
WHERE p.user_id IS NULL;
Exercise

Find customers who have received at least one campaign send but have never made a purchase.

Schema hint

dim_customers(id, email), fact_sends(customer_id, campaign_id, sent_at), fact_purchases(customer_id, purchased_at)

Expected

One row per qualifying customer: id, email. Ordered by id.

Show solution
SELECT DISTINCT c.id, c.email
FROM dim_customers c
JOIN fact_sends s ON s.customer_id = c.id
LEFT JOIN fact_purchases p ON p.customer_id = c.id
WHERE p.customer_id IS NULL
ORDER BY c.id;

Pattern 5 — Deduplication keeping the latest

Given duplicate rows, keep only the most recent record per entity. Verbatim: *"Keep only the most recent update per user_id"* (data cleaning, every team), *"Identify duplicate job listings accidentally posted by the same company"* (LinkedIn).

WITH latest AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY updated_at DESC
    ) AS rn
  FROM user_snapshots
)
SELECT *
FROM latest
WHERE rn = 1;

Alternatives to mention: Postgres’ SELECT DISTINCT ON (user_id) ... ORDER BY user_id, updated_at DESC, or a correlated subquery with MAX(updated_at). Narrate why you picked the window-function form: it's dialect-portable and reads as one scan instead of a correlated per-row lookup.

Pattern 6 — Pivoting with conditional aggregation

Turn rows into columns with SUM + CASE. Verbatim: *"Build a quarterly spend report with one column per department"* (finance), *"Compute the average star rating per product per month, with months as columns"* (Amazon).

SELECT
  DATE_TRUNC('quarter', booked_at) AS quarter,
  SUM(CASE WHEN dept = 'IT'  THEN amount END) AS it_spend,
  SUM(CASE WHEN dept = 'HR'  THEN amount END) AS hr_spend,
  SUM(CASE WHEN dept = 'Eng' THEN amount END) AS eng_spend,
  SUM(CASE WHEN dept NOT IN ('IT','HR','Eng') THEN amount END) AS other_spend
FROM expenses
GROUP BY 1
ORDER BY 1;

Pattern 7 — Date truncation and time bucketing

Aggregate a timestamp column to a coarser time grain. Verbatim: *"For each month, count Facebook users who created their account that month"* (Meta), *"Aggregate transactions into weekly time series"*, *"Compute month-over-month growth"*.

SELECT
  DATE_TRUNC('month', created_at) AS signup_month,
  COUNT(*) AS new_users
FROM users
GROUP BY 1
ORDER BY 1;

Eight trick questions to memorize

Beyond the seven patterns, a short list of single-answer gotchas shows up in virtually every screen. These aren’t patterns — they’re tests of whether you've actually been bitten by real data. Each one is worth rehearsing until your answer is reflexive:

  1. NULLs in aggregates. COUNT(*) counts all rows; COUNT(col) counts non-NULL. AVG(col) ignores NULL, so AVG(col) ≠ SUM(col)/COUNT(*) when NULLs exist.
  2. GROUP BY and NULL. NULLs group together as a single bucket, even though NULL = NULL is UNKNOWN elsewhere.
  3. Inclusive vs exclusive date bounds. BETWEEN '2025-01-01' AND '2025-01-31' silently drops events after 00:00 on Jan 31 if the column is a timestamp. Prefer >= start AND < next_start.
  4. HAVING vs WHERE. WHERE filters rows before aggregation; HAVING filters groups after. Aggregate functions in WHERE is an instant fail.
  5. Row explosion from joins. A JOIN that introduces duplicates makes COUNT(user_id) overstate; use COUNT(DISTINCT user_id) or aggregate before joining.
  6. LEFT JOIN → INNER via WHERE. Putting a filter on the right table in the WHERE clause converts the LEFT JOIN silently. Put it in the ON clause instead.
  7. Integer division. SUM(successes) / COUNT(*) returns 0 in Postgres and MySQL when both are ints. Cast one to float: SUM(successes) * 1.0 / COUNT(*).
  8. Timezones. UTC event timestamps to user-local time: know AT TIME ZONE in Postgres, and be ready to discuss DST.

If you recognize only three or four of those as things you'd catch live, your next prep session should be the traps category — these are exactly the gotchas built into the explainers there.

How to practice these for real

Recognition is cheap; *reflex* is the goal. You want the mental move from "this is a top-N per group" to a written window-function CTE to happen in under 15 seconds on a whiteboard. Three concrete practice loops:

  1. Drill the patterns against realistic data. The caseSQL mission path at /path/marketing is set inside a simulated marketing team — every mission uses pattern 1, 2, 4, 5, 6, or 7 against a star schema with planted data-quality issues. Cheaper and stickier than abstract puzzles because the queries answer a business question.
  2. Drill the traps separately. The Interview Prep → Traps category has single-answer explainers for each of the eight gotchas above. Do them once a week until the answers are automatic.
  3. Rehearse narration out loud. The thing you're testing is whether you can *explain* the pattern choice, not whether you can type it. Open the prep guide for the hidden rubric interviewers actually grade on — narrating trade-offs is on the list.

Build these reflexes against real data.

caseSQL runs 100+ missions against a realistic star schema with planted data-quality issues. Free tier runs in your browser; upgrade to Pro for interview prompts.

Keep reading