Data Quality

Anti-Joins and the LEFT JOIN ... IS NULL Pattern

Finding what’s missing — the single most-used and most-often-written-wrong pattern in analyst SQL.

7 min read

The question "what’s on one side but not the other" is asked every day in analytics teams. Users who signed up but haven’t purchased. Campaigns that fired but never delivered. Accounts with open invoices but no recent activity. Each of those is an anti-join, and each of them is a query that juniors write wrong slightly-too-often.

This post is a deliberate short read on the one pattern: what it is, the three ways to write it, why two of the three are booby traps, and when the pattern shows up in real analyst work.

What an anti-join is, in plain terms

An anti-join returns rows from the left table that have NO matching row on the right table. "Give me every customer who has never purchased" is the canonical example. Contrast with an inner join ("give me every customer who HAS purchased") or a left join without a filter ("give me every customer, plus their purchases if any").

The word "anti-join" itself is database-engine terminology — PostgreSQL and SQL Server will physically plan it as a special join type internally — but the pattern in analyst SQL is written as a regular join plus a filter.

The three ways to write it — only one is safe

All three of these return the same result against clean data. Exactly one of them survives real data with NULLs in it.

Form 1: LEFT JOIN … IS NULL (the safe form)

SELECT c.*
FROM dim_customers c
LEFT JOIN fact_purchases p USING (customer_id)
WHERE p.customer_id IS NULL;

The LEFT JOIN keeps every customer; the WHERE filter on the right-side column IS NULL keeps only the rows that didn’t match. This form is correct regardless of NULLs anywhere in the data, is readable once you’ve seen it a few times, and is how every senior analyst writes the pattern. Memorize this form; use it reflexively.

Form 2: NOT EXISTS (also safe, often faster)

SELECT c.*
FROM dim_customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM fact_purchases p
  WHERE p.customer_id = c.customer_id
);

NOT EXISTS is equivalently correct and often a better query plan on large tables. It’s the form you’ll see in production pipelines where the optimizer cost matters. Slightly harder to read than the LEFT JOIN form; pick whichever your team already uses.

Form 3: NOT IN (the unsafe form)

SELECT *
FROM dim_customers
WHERE customer_id NOT IN (
  SELECT customer_id FROM fact_purchases
);

Why NOT IN silently breaks on NULLs

SQL has three-valued logic: TRUE, FALSE, and UNKNOWN. Any comparison involving NULL returns UNKNOWN. NOT IN is logically equivalent to a chain of <> x AND <> y AND <> z; if any element in the subquery is NULL, the chain becomes ... AND <> NULL, which evaluates to UNKNOWN, which the WHERE clause treats as "not true," which drops the row from the result.

Concretely: if fact_purchases has 10,000 rows and ONE of them has customer_id IS NULL, the NOT IN query returns zero rows instead of the correct answer. Your dashboard is silently wrong. The LEFT JOIN … IS NULL form doesn’t suffer from this because it’s not using NOT IN — it’s checking whether a specific join produced a match.

The "safe" fix for NOT IN is to add WHERE customer_id IS NOT NULL to the subquery. That works. It’s also the third and fourth thing a tired analyst at 5 pm on a Thursday forgets to add. Better to never write the pattern at all.

When anti-joins show up in real analyst work

Once you have the pattern in your reflex, you’ll start noticing how often the business question IS an anti-join. A partial list from real analyst tickets:

  • "How many accounts haven’t logged in this quarter?"
  • "Which vendors are on the approved list but have zero purchases against them?"
  • "Which employees are flagged as managers but don’t have any direct reports?"
  • "Which campaigns were funded but never had creative assets uploaded?"
  • "Which claims exist but have no matching pre-authorization record?"
  • "Which support tickets were assigned but have no first response after 48 hours?"
Exercise

For a dim_customers + fact_purchases schema, return every customer whose MOST RECENT purchase was more than 90 days ago OR who has never purchased at all — the "dormant + never" set. This is a slightly harder variant of the pattern because "dormant" is a second filter on top of the anti-join.

Schema hint

dim_customers(customer_id, email, signup_date), fact_purchases(purchase_id, customer_id, amount, purchased_at).

Expected

One row per customer. Columns: customer_id, email, last_purchase_at (NULL for never-purchasers), dormant_status (“Never” or “90d+”).

Show solution
WITH last_purchase AS (
  SELECT customer_id, MAX(purchased_at) AS last_at
  FROM fact_purchases
  GROUP BY customer_id
)
SELECT
  c.customer_id,
  c.email,
  lp.last_at AS last_purchase_at,
  CASE
    WHEN lp.last_at IS NULL THEN 'Never'
    WHEN lp.last_at < CURRENT_DATE - INTERVAL '90 days' THEN '90d+'
  END AS dormant_status
FROM dim_customers c
LEFT JOIN last_purchase lp USING (customer_id)
WHERE lp.last_at IS NULL
   OR lp.last_at < CURRENT_DATE - INTERVAL '90 days';

Anti-joins are one of maybe five SQL patterns that show up in every analyst’s weekly work. Internalizing the LEFT JOIN … IS NULL form (or the NOT EXISTS equivalent) and never writing NOT IN against a nullable column will quietly eliminate an entire class of wrong-dashboard incidents from your career.

Build these reflexes against real data.

caseSQL runs 100+ missions against a realistic star schema with planted data-quality issues. Free tier runs in your browser; upgrade to Pro for interview prompts.

Keep reading