Data Quality

How to Read a Messy Database Like a Senior Analyst

Every production database is a crime scene. Here’s the senior-analyst workflow for reading it: what to profile, what to distrust, and how to sanity-check before you report a number.

10 min read

Senior analysts don’t have a magical SQL skill juniors lack. They have a specific workflow they run before they trust a single number the database returns. The workflow comes from being burned — every senior analyst can tell you about the quarterly review where their number was off by 30% because of a silent join. This post is that workflow, distilled. Five steps, in order, with exercises.

Databases are messy because they’re alive

A database is not a dataset. A dataset is a snapshot: frozen in time, reasoned about as a static artifact, teachable by a course. A database is a living system: being written to by ten different services, backfilled by Tuesday’s cron, shaped by three product managers’ choices over five years, with a schema that still carries scars from the 2021 migration that went halfway.

When you query a database, you’re asking a question of a system that has layers of sediment: the original schema, the first emergency hotfix, the second schema migration that nobody named, the columns added for a feature that shipped and was deprecated, the NULLs that mean "not recorded" in one table and "not applicable" in another. Every one of those layers is a potential lie.

Senior analysts know this in their bones. They treat every new table as hostile until proven otherwise. That’s not paranoia — it’s correctness.

Start with the data dictionary (or the lack of one)

The first thing a senior analyst does in a new database is not write a query. It’s look for the data dictionary — the document that says what each table and column means, who owns it, and what the known gotchas are.

About 30% of companies have a real, maintained data dictionary. Another 30% have a stale one in a wiki page that was last updated in 2022. The remaining 40% have nothing — you ask the person who built the schema, if they still work there, or you reverse-engineer it from dbt models, or you piece it together from SQL written by the analyst who was there before you.

Either way, step 1 is to write your own. It doesn’t have to be elaborate. For every table you’re going to query, spend five minutes writing: what does each column mean in one sentence, what’s the grain (one row per what?), and what joins are expected. Do this in a schema.md file in your work repo. Six months from now, future-you will thank present-you.

Exercise

Explore phase: you’ve just been handed a new database. Write the exploration queries that tell you (a) which tables exist, (b) the rowcount of each, and (c) for the three biggest tables, the date range of their timestamp columns.

Schema hint

In SQLite: sqlite_master lists tables. In Postgres: information_schema.tables. The pattern is the same — query the catalog, then loop through the top-N.

Expected

A short sequence of queries you’d run one after another. Output: table list, rowcounts, and date-range snapshots.

Show solution
-- (a) List tables.
SELECT name
FROM sqlite_master
WHERE type = 'table'
ORDER BY name;

-- (b) Rowcount each table (run per-table; macro'd in practice).
SELECT 'dim_customers' AS table_name, COUNT(*) AS rows FROM dim_customers
UNION ALL SELECT 'dim_campaigns',  COUNT(*) FROM dim_campaigns
UNION ALL SELECT 'fact_sends',     COUNT(*) FROM fact_sends
UNION ALL SELECT 'fact_sessions',  COUNT(*) FROM fact_sessions
UNION ALL SELECT 'fact_purchases', COUNT(*) FROM fact_purchases;

-- (c) Date range of the largest fact tables — shows you the time window
--     you're actually working with (often narrower than stakeholders think).
SELECT 'fact_sends'     AS t, MIN(sent_at)      AS min_ts, MAX(sent_at)      AS max_ts FROM fact_sends
UNION ALL
SELECT 'fact_purchases',     MIN(purchased_at),           MAX(purchased_at)          FROM fact_purchases
UNION ALL
SELECT 'fact_sessions',      MIN(session_at),             MAX(session_at)            FROM fact_sessions;

This takes five minutes and surfaces three things almost every time: a table that’s much bigger than you expected (hint: probably double-writes), a time range that doesn’t go back as far as you thought (hint: the backfill stopped in 2023), and a table you didn’t know existed (hint: it’s the one you actually need).

Profile: NULLs, duplicates, inconsistencies

Once you know what tables exist, profile the ones you’ll actually query. Profiling is the five-minute cousin of a full data-quality audit. You’re not trying to be thorough — you’re trying to surface the obvious lies before they bite you.

Three profile checks, in order:

  1. NULL rate per column — what percentage of each column is NULL? A column that’s 40% NULL means something different from what a tutorial would teach you to expect.
  2. Distinct-value count per categorical column — does the "channel" column have 5 values or 500? (500 means someone let free-text into it.)
  3. Duplicate check on the supposed primary key — is the id column actually unique, or has someone quietly violated the constraint?
Exercise

Profile the dim_customers table: for each non-id column, report the NULL count, the distinct-value count, and (for the email column specifically) check whether it’s actually unique.

Schema hint

dim_customers(id, email, first_name, last_name, country, signup_date). The email field is supposed to be unique but isn’t always.

Expected

A wide row with NULL counts and distinct counts for each column, plus a separate query that returns the emails that appear more than once.

Show solution
-- Column profile.
SELECT
  COUNT(*)                                  AS total_rows,
  SUM(CASE WHEN email        IS NULL THEN 1 ELSE 0 END) AS null_email,
  SUM(CASE WHEN first_name   IS NULL THEN 1 ELSE 0 END) AS null_first_name,
  SUM(CASE WHEN country      IS NULL THEN 1 ELSE 0 END) AS null_country,
  SUM(CASE WHEN signup_date  IS NULL THEN 1 ELSE 0 END) AS null_signup,
  COUNT(DISTINCT country)                  AS distinct_country,
  COUNT(DISTINCT email)                    AS distinct_email
FROM dim_customers;

-- Duplicate-email check.
SELECT
  LOWER(TRIM(email)) AS normalized_email,
  COUNT(*)           AS copies
FROM dim_customers
WHERE email IS NOT NULL
GROUP BY normalized_email
HAVING COUNT(*) > 1
ORDER BY copies DESC;

Notice the LOWER(TRIM(...)) in the duplicate check. Email case-sensitivity and leading-whitespace bugs are the two ways duplicate emails most commonly slip past the uniqueness check. Never trust COUNT(DISTINCT email) on a raw column; always normalize first.

Reconcile data vs folklore

Every team has folklore — numbers that circulate in slide decks and all-hands meetings, numbers everyone "just knows," numbers that were true in 2023 and have been repeated ever since. Part of the senior-analyst job is reconciling what the data says with what the folklore says.

The reconciliation itself is the skill. The folklore usually has a kernel of truth, which means the answer isn’t "the data is right, the folklore is wrong." It’s "the folklore was right under these assumptions; under current assumptions, the number is different; here’s why." That’s a useful answer. "Your number is wrong" is not a useful answer.

Concrete move: when a stakeholder cites a number you can’t reproduce from the database, your first question is "what definition was this using?" You’ll usually find that the 2023 version filtered differently, used a different time window, or excluded a segment the current number includes. Half the value of a senior analyst is being the person who writes the new definition down so nobody has to re-litigate it.

A working mental model: assume the data lies

Tie it together with one mental model you can drill into reflex: assume the data lies until proven honest. Every query is innocent until guilty-until-proven-innocent.

In practice this means: before you hand a number to anyone, you’ve done three things. You’ve checked the scale (does the order of magnitude match a rough back-of-envelope?). You’ve checked the delta (is the change from last period believable without an external event?). You’ve checked the top row (does the biggest contributor make sense?). All three take 90 seconds. All three catch the vast majority of query bugs.

Exercise

Sanity-check pass: take any revenue number you’d report (say, last 30 days revenue). Write three small queries that together stress-test it: a scale check, a delta check against the prior 30, and a top-5-contributors check that you can eyeball.

Schema hint

All three run against fact_purchases. The goal is to produce three small results you can review in 30 seconds each.

Expected

Three result sets: headline number, delta-to-prior-period, top-5 contributors by some grouping (campaign, channel, or source).

Show solution
-- 1. Scale check — the headline number.
SELECT ROUND(SUM(purchase_amount), 2) AS revenue_last_30
FROM fact_purchases
WHERE purchased_at >= DATE('now', '-30 days');

-- 2. Delta check — last 30 vs prior 30.
SELECT
  SUM(CASE WHEN purchased_at >= DATE('now', '-30 days')
           THEN purchase_amount ELSE 0 END) AS last_30,
  SUM(CASE WHEN purchased_at BETWEEN DATE('now', '-60 days')
                                 AND DATE('now', '-30 days')
           THEN purchase_amount ELSE 0 END) AS prev_30,
  ROUND(
    (SUM(CASE WHEN purchased_at >= DATE('now', '-30 days')
              THEN purchase_amount ELSE 0 END)
     -
     SUM(CASE WHEN purchased_at BETWEEN DATE('now', '-60 days')
                                    AND DATE('now', '-30 days')
              THEN purchase_amount ELSE 0 END))
    * 100.0
    / NULLIF(SUM(CASE WHEN purchased_at BETWEEN DATE('now', '-60 days')
                                            AND DATE('now', '-30 days')
                      THEN purchase_amount ELSE 0 END), 0),
    1
  ) AS pct_change
FROM fact_purchases;

-- 3. Top-5 contributors — campaigns driving the 30-day revenue.
SELECT c.name, ROUND(SUM(p.purchase_amount), 2) AS revenue
FROM fact_purchases p
JOIN dim_campaigns c ON c.id = p.campaign_id
WHERE p.purchased_at >= DATE('now', '-30 days')
GROUP BY c.name
ORDER BY revenue DESC
LIMIT 5;

The first time you use this pass, it catches a bug. You find a join fanning out, or a filter that was tighter than you thought, or a campaign name that broke out of its category. After the hundredth time, it’s muscle memory: every headline number gets the three-check pass before it leaves your laptop. That’s the workflow. That’s what "senior analyst" looks like when you zoom all the way in.

The bug taxonomy you’ll see most

After a year or two on any production database, you’ll develop a mental taxonomy of the bugs that come up again and again. The categories differ by org, but the shape is always similar:

  • Silent join multiplication — a one-to-many join where you expected one-to-one, quietly doubling your aggregate. Usually surfaces as "this number is 2x what I expected" and is caught by the scale check above.
  • Backfill seams — a day or week where the numbers look wrong because a pipeline failed, was rerun manually, and double-counted some rows. Surfaces as a weird spike on a specific date.
  • Semantic drift — the meaning of a column changed when a product team shipped a feature, but nobody updated the analytics. "is_active" used to mean "logged in in the last 30 days"; now it means "subscription status is not canceled."
  • Timezone ghosts — a "daily" report that’s actually UTC-day running against data that’s local-time, so certain customers bleed across day boundaries and the weekend looks different from the weekday for no real reason.

Every one of these has the same fix — you catch it by running the sanity pass, then you write a CTE or a filter that documents the guard explicitly at the top of your query, with a comment explaining why. Six months later, when another analyst copies your query, the guard carries forward. That’s how institutional knowledge gets encoded in a database team.

Build these reflexes against real data.

caseSQL runs 100+ missions against a realistic star schema with planted data-quality issues. Free, in-browser, no account needed.

Keep reading