Data Quality

The SQL Data-Quality Checks Senior Analysts Run Before Every Report

The 30-second sweep that catches 80% of "why does this number look weird" bugs before they reach a stakeholder — NULL audits, dedup checks, refund leakage, timezone sanity.

10 min read

The difference between a senior analyst and a junior one is rarely SQL fluency. It’s the five-minute sweep the senior runs before sending a report — the reflex check that catches the NULL in a denominator, the duplicate customer row, the refund that snuck into gross revenue. Those are the bugs that make a stakeholder stop trusting you for a quarter. Shipping clean queries means catching them first.

This post is the five-check pre-flight audit. Each one takes under a minute. Each one has saved a senior analyst from a bad week multiple times.

The 30-second data-quality sweep

Before you send any number to a stakeholder, run these five in order:

  1. NULL audit on the join keys and the metric columns.
  2. Duplicate check on the supposedly-unique identifier in every dimension.
  3. Refund / reversal leakage — is anything negative or reversed hiding inside the aggregate?
  4. Date range — are you querying the window the stakeholder asked for?
  5. Timezone sanity — is "today" the same day for everyone in the data?

That’s it. No tool, no dbt tests, no data observability product. Just five queries you can type from memory in under five minutes total. The rest of this post is the muscle memory.

NULL audits that catch 80% of bugs

Every time you join two tables or aggregate a metric column, ask: "what does NULL mean here, and how many do I have?" Silent NULLs on a join key mean rows silently drop from the result. Silent NULLs in a SUM() mean they’re treated as zero — which might be what you want, or might underreport by 15%.

-- Quick null census for a given table. Adjust columns to taste.
SELECT
  COUNT(*)                                   AS rows,
  COUNT(customer_id)                         AS customer_id_non_null,
  COUNT(*) - COUNT(customer_id)              AS customer_id_null,
  COUNT(purchase_amount)                     AS amount_non_null,
  COUNT(*) - COUNT(purchase_amount)          AS amount_null,
  COUNT(DISTINCT customer_id)                AS distinct_customers
FROM fact_purchases;

COUNT(column) skips NULLs; COUNT(*) counts every row. The difference tells you exactly how many rows have a NULL in that column. Do this for every join key in every fact table you touch. If 2% of fact_sessions.customer_id is NULL, your "sessions per customer" number is low by about 2% after you join to dim_customers — and nobody told you.

Finding duplicates with ROW_NUMBER

Dimension tables are supposed to have one row per entity. In practice, they routinely don’t — merge conflicts, botched ETL reruns, customers who signed up twice with different capitalizations. A duplicate in a dimension silently doubles every aggregate that joins to it.

The senior-level dedup check isn’t SELECT COUNT(DISTINCT id) FROM table — that only catches exact ID duplicates. The real check uses ROW_NUMBER to find near-duplicates on business-key columns:

SELECT *
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY LOWER(email), LOWER(first_name || ' ' || last_name)
      ORDER BY signed_up_at ASC
    ) AS dup_rank
  FROM dim_customers
) x
WHERE dup_rank > 1;

That finds every row whose (email, full name) combination is a second, third, fourth occurrence. If the query returns 0 rows, the dimension is clean. If it returns a handful, escalate to the data team (and deduplicate in your own query before joining).

Exercise

Find how many duplicate customer rows exist in dim_customers by matching on lowered email. Return one row per duplicate email showing how many copies exist.

Schema hint

dim_customers(customer_id, email, first_name, last_name, signed_up_at)

Expected

Rows where the same lowered email appears more than once. Columns: email_lower, n_copies. Empty result means the dim is clean.

Show solution
SELECT
  LOWER(email) AS email_lower,
  COUNT(*)     AS n_copies
FROM dim_customers
WHERE email IS NOT NULL
GROUP BY LOWER(email)
HAVING COUNT(*) > 1
ORDER BY n_copies DESC;

Refund and reversal leakage patterns

This is the one that burns people. Most fact tables include both positive transactions (purchases) and negative ones (refunds, chargebacks, reversals). A naive SUM(purchase_amount) looks right — until you realize it already nets refunds out, which is NOT what the stakeholder who asked for "total revenue" meant. Or the other way: a table where refunds live as separate rows with a type = 'refund' flag, and the naive sum overcounts revenue by ignoring refunds entirely.

-- Audit: how much negative/reversal value is hiding in the aggregate?
SELECT
  SUM(CASE WHEN purchase_amount > 0 THEN purchase_amount ELSE 0 END) AS gross_positive,
  SUM(CASE WHEN purchase_amount < 0 THEN purchase_amount ELSE 0 END) AS gross_negative,
  SUM(purchase_amount)                                               AS net,
  COUNT(CASE WHEN purchase_amount < 0 THEN 1 END)                    AS refund_row_count,
  COUNT(*)                                                           AS total_rows
FROM fact_purchases
WHERE purchased_at >= DATE('now', '-30 days');

That one query tells you: how much gross positive revenue there was, how much negative (refunds) is netting in, and how many refund rows exist. If the stakeholder wanted gross revenue, use the first. If they wanted net, use the third. Always ask which one they meant — it’s the single clarifying question that saves the most report rewrites.

Exercise

Imagine a campaign-performance query sums purchase_amount grouped by campaign and you need to confirm refunds aren’t silently netting out the total. Write a query that shows each campaign’s gross revenue and its refunded amount separately, last 30 days.

Schema hint

fact_purchases(campaign_id, purchase_amount, purchased_at), dim_campaigns(id, name)

Expected

One row per campaign. Columns: campaign_name, gross_revenue, refunds, net_revenue. Ordered by gross_revenue desc.

Show solution
SELECT
  c.name AS campaign_name,
  SUM(CASE WHEN p.purchase_amount > 0 THEN p.purchase_amount ELSE 0 END) AS gross_revenue,
  ABS(SUM(CASE WHEN p.purchase_amount < 0 THEN p.purchase_amount ELSE 0 END)) AS refunds,
  SUM(p.purchase_amount) AS net_revenue
FROM dim_campaigns c
JOIN fact_purchases p ON p.campaign_id = c.id
WHERE p.purchased_at >= DATE('now', '-30 days')
GROUP BY c.name
ORDER BY gross_revenue DESC;

Date-range and timezone sanity

The last two checks are small but catch a disproportionate number of bugs. First: what date range does the query actually cover? Stakeholders say "last 30 days" and mean something different than CURRENT_DATE - 30. Ask.

SELECT
  MIN(purchased_at) AS first_row,
  MAX(purchased_at) AS last_row,
  COUNT(*)          AS rows_in_range,
  COUNT(DISTINCT DATE(purchased_at)) AS distinct_days
FROM fact_purchases
WHERE purchased_at >= DATE('now', '-30 days');

Second: timezones. Is purchased_at stored in UTC or the user’s local timezone? Does "today’s revenue" mean today-in-UTC or today-in-user-local? If the answer is "it depends," you’re going to produce a chart that disagrees with Finance by ~4% on any given day. Many warehouses store both — the raw UTC timestamp and a pre-converted local one — so check which column you’re filtering on before you ship.

None of these checks requires a data observability tool, a dbt test, or a production pipeline. They’re all just SQL, run ad-hoc, by a human who has seen enough bad reports to know what to look for. That’s the trait companies hire senior analysts for — and it’s the trait that caseSQL’s intentionally-messy schemas are designed to train.

Build these reflexes against real data.

caseSQL runs 100+ missions against a realistic star schema with planted data-quality issues. Free tier runs in your browser; upgrade to Pro for interview prompts.

Keep reading