The difference between a senior analyst and a junior one is rarely SQL fluency. It’s the five-minute sweep the senior runs before sending a report — the reflex check that catches the NULL in a denominator, the duplicate customer row, the refund that snuck into gross revenue. Those are the bugs that make a stakeholder stop trusting you for a quarter. Shipping clean queries means catching them first.
This post is the five-check pre-flight audit. Each one takes under a minute. Each one has saved a senior analyst from a bad week multiple times.
The 30-second data-quality sweep
Before you send any number to a stakeholder, run these five in order:
- NULL audit on the join keys and the metric columns.
- Duplicate check on the supposedly-unique identifier in every dimension.
- Refund / reversal leakage — is anything negative or reversed hiding inside the aggregate?
- Date range — are you querying the window the stakeholder asked for?
- Timezone sanity — is "today" the same day for everyone in the data?
That’s it. No tool, no dbt tests, no data observability product. Just five queries you can type from memory in under five minutes total. The rest of this post is the muscle memory.
NULL audits that catch 80% of bugs
Every time you join two tables or aggregate a metric column, ask: "what does NULL mean here, and how many do I have?" Silent NULLs on a join key mean rows silently drop from the result. Silent NULLs in a SUM() mean they’re treated as zero — which might be what you want, or might underreport by 15%.
-- Quick null census for a given table. Adjust columns to taste.
SELECT
COUNT(*) AS rows,
COUNT(customer_id) AS customer_id_non_null,
COUNT(*) - COUNT(customer_id) AS customer_id_null,
COUNT(purchase_amount) AS amount_non_null,
COUNT(*) - COUNT(purchase_amount) AS amount_null,
COUNT(DISTINCT customer_id) AS distinct_customers
FROM fact_purchases;COUNT(column) skips NULLs; COUNT(*) counts every row. The difference tells you exactly how many rows have a NULL in that column. Do this for every join key in every fact table you touch. If 2% of fact_sessions.customer_id is NULL, your "sessions per customer" number is low by about 2% after you join to dim_customers — and nobody told you.
Finding duplicates with ROW_NUMBER
Dimension tables are supposed to have one row per entity. In practice, they routinely don’t — merge conflicts, botched ETL reruns, customers who signed up twice with different capitalizations. A duplicate in a dimension silently doubles every aggregate that joins to it.
The senior-level dedup check isn’t SELECT COUNT(DISTINCT id) FROM table — that only catches exact ID duplicates. The real check uses ROW_NUMBER to find near-duplicates on business-key columns:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY LOWER(email), LOWER(first_name || ' ' || last_name)
ORDER BY signed_up_at ASC
) AS dup_rank
FROM dim_customers
) x
WHERE dup_rank > 1;That finds every row whose (email, full name) combination is a second, third, fourth occurrence. If the query returns 0 rows, the dimension is clean. If it returns a handful, escalate to the data team (and deduplicate in your own query before joining).
Find how many duplicate customer rows exist in dim_customers by matching on lowered email. Return one row per duplicate email showing how many copies exist.
dim_customers(customer_id, email, first_name, last_name, signed_up_at)
Rows where the same lowered email appears more than once. Columns: email_lower, n_copies. Empty result means the dim is clean.
Show solution
SELECT
LOWER(email) AS email_lower,
COUNT(*) AS n_copies
FROM dim_customers
WHERE email IS NOT NULL
GROUP BY LOWER(email)
HAVING COUNT(*) > 1
ORDER BY n_copies DESC;Refund and reversal leakage patterns
This is the one that burns people. Most fact tables include both positive transactions (purchases) and negative ones (refunds, chargebacks, reversals). A naive SUM(purchase_amount) looks right — until you realize it already nets refunds out, which is NOT what the stakeholder who asked for "total revenue" meant. Or the other way: a table where refunds live as separate rows with a type = 'refund' flag, and the naive sum overcounts revenue by ignoring refunds entirely.
-- Audit: how much negative/reversal value is hiding in the aggregate?
SELECT
SUM(CASE WHEN purchase_amount > 0 THEN purchase_amount ELSE 0 END) AS gross_positive,
SUM(CASE WHEN purchase_amount < 0 THEN purchase_amount ELSE 0 END) AS gross_negative,
SUM(purchase_amount) AS net,
COUNT(CASE WHEN purchase_amount < 0 THEN 1 END) AS refund_row_count,
COUNT(*) AS total_rows
FROM fact_purchases
WHERE purchased_at >= DATE('now', '-30 days');That one query tells you: how much gross positive revenue there was, how much negative (refunds) is netting in, and how many refund rows exist. If the stakeholder wanted gross revenue, use the first. If they wanted net, use the third. Always ask which one they meant — it’s the single clarifying question that saves the most report rewrites.
Imagine a campaign-performance query sums purchase_amount grouped by campaign and you need to confirm refunds aren’t silently netting out the total. Write a query that shows each campaign’s gross revenue and its refunded amount separately, last 30 days.
fact_purchases(campaign_id, purchase_amount, purchased_at), dim_campaigns(id, name)
One row per campaign. Columns: campaign_name, gross_revenue, refunds, net_revenue. Ordered by gross_revenue desc.
Show solution
SELECT
c.name AS campaign_name,
SUM(CASE WHEN p.purchase_amount > 0 THEN p.purchase_amount ELSE 0 END) AS gross_revenue,
ABS(SUM(CASE WHEN p.purchase_amount < 0 THEN p.purchase_amount ELSE 0 END)) AS refunds,
SUM(p.purchase_amount) AS net_revenue
FROM dim_campaigns c
JOIN fact_purchases p ON p.campaign_id = c.id
WHERE p.purchased_at >= DATE('now', '-30 days')
GROUP BY c.name
ORDER BY gross_revenue DESC;Date-range and timezone sanity
The last two checks are small but catch a disproportionate number of bugs. First: what date range does the query actually cover? Stakeholders say "last 30 days" and mean something different than CURRENT_DATE - 30. Ask.
SELECT
MIN(purchased_at) AS first_row,
MAX(purchased_at) AS last_row,
COUNT(*) AS rows_in_range,
COUNT(DISTINCT DATE(purchased_at)) AS distinct_days
FROM fact_purchases
WHERE purchased_at >= DATE('now', '-30 days');Second: timezones. Is purchased_at stored in UTC or the user’s local timezone? Does "today’s revenue" mean today-in-UTC or today-in-user-local? If the answer is "it depends," you’re going to produce a chart that disagrees with Finance by ~4% on any given day. Many warehouses store both — the raw UTC timestamp and a pre-converted local one — so check which column you’re filtering on before you ship.
None of these checks requires a data observability tool, a dbt test, or a production pipeline. They’re all just SQL, run ad-hoc, by a human who has seen enough bad reports to know what to look for. That’s the trait companies hire senior analysts for — and it’s the trait that caseSQL’s intentionally-messy schemas are designed to train.