Technique

SQL Performance for Analysts (Without Becoming a DBA)

The five query-performance moves that matter for analysts — indexes, EXPLAIN, materialization, and when a "slow" query is actually a grain bug in disguise.

9 min read

The gap between "I wrote a query that works" and "I wrote a query that finishes in under a minute on production data" is wider than most SQL courses admit. You don’t need to become a DBA to close it. You need five moves — how to read EXPLAIN, when to lean on an index, when to pre-materialize a subquery, and how to notice when a query is slow because it’s computing the wrong thing against 10x the data.

This post is written for analysts on Postgres (Snowflake, BigQuery, and Redshift have their own wrinkles, but the mental model translates). Everything below is the 80% that pays for itself; the other 20% is rare and belongs to the platform team.

The analyst’s performance budget

Before any tactic, the posture. Queries have an implicit budget set by what they’re used for:

  • Ad-hoc exploration — 2 minutes is fine. Over 5 and you reach for LIMIT while you iterate.
  • Dashboard query — must complete under the dashboard’s refresh tolerance; typically < 10 seconds for live-load tiles, 60 seconds for scheduled refresh.
  • Scheduled job — anything under the schedule window. A query that takes 30 minutes but runs nightly at 2 AM is fine.

Reading EXPLAIN without reading the whole manual

EXPLAIN shows the plan the database made for your query. The full output is intimidating; the analyst-useful parts are a small subset. Prefix any query with EXPLAIN to see the plan without running it, or EXPLAIN ANALYZE to run it and see actual timings.

EXPLAIN ANALYZE
SELECT c.customer_id, SUM(p.amount) AS lifetime_value
FROM dim_customers c
LEFT JOIN fact_purchases p USING (customer_id)
GROUP BY c.customer_id;

Three things to look for in the output, in order of impact:

  1. Seq Scan on a large table. A sequential scan reads every row. On a 100M-row fact table that’s slow. An index-scan or bitmap-scan is usually faster. If you see Seq Scan on fact_purchases and the table is large, the query is probably missing an index on a filter or join column.
  2. Nested Loop on two large tables. Nested loops work one row at a time against every row on the other side. Fine for small-vs-large; catastrophic for large-vs-large. Replacing with a Hash Join or Merge Join is the optimizer’s job, but it needs statistics to be correct (run ANALYZE on stale tables).
  3. Rows Removed by Filter. If EXPLAIN shows 10M rows scanned and 9.9M filtered out, the filter isn’t being pushed to an index. This is the #1 win for most analyst queries: filter in the subquery, not outside it.

When an index actually helps

An index helps a SELECT when the database can use it to avoid scanning the whole table. The two cases where an index is worth asking the platform team to create:

  • Filter on a high-cardinality column. WHERE customer_id = 12345 on a 100M-row table goes from seconds to milliseconds with a customer_id index. WHERE status = 'active' where 80% of rows are active doesn’t help much — the database would still read most of the table.
  • JOIN predicate column. The foreign-key column on the "many" side of a join. If fact_purchases.customer_id is unindexed, the JOIN from dim_customers has to scan every purchase.

What DOESN’T get faster with an index: aggregations over everything (GROUP BY with no filter), LIKE '%...' patterns (leading wildcard bypasses the index), function calls on the indexed column (WHERE LOWER(email) = '...' can’t use an index on email; needs a functional index instead).

Materialization: CTEs, temp tables, and cached views

Sometimes the same expensive subquery runs three times in one query. Materializing it — forcing the database to compute it once and reuse the result — can be the biggest single optimization available.

  • CTEs (WITH blocks) — in Postgres 12+, CTEs are inlined by default (same as a subquery). In older Postgres or if you force MATERIALIZED, they’re computed once. Useful when the CTE filters aggressively and the main query joins against it multiple times.
  • Temp tablesCREATE TEMP TABLE x AS SELECT ... persists for the session. Worth it when one heavy subquery powers 3+ downstream analyses in the same notebook. Not worth it for a single query.
  • Materialized views — managed by the platform team. Right answer when a dashboard tile runs the same expensive aggregation every 5 minutes against source data that updates hourly.
-- Force materialization of an expensive CTE in Postgres 12+
WITH active_customers AS MATERIALIZED (
  SELECT customer_id, MAX(session_at) AS last_seen
  FROM fact_sessions
  WHERE session_at >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY customer_id
)
SELECT
  c.segment,
  COUNT(*)    AS active_count,
  AVG(p.amt)  AS avg_spend
FROM active_customers a
JOIN dim_customers c USING (customer_id)
LEFT JOIN (
  SELECT customer_id, SUM(amount) AS amt FROM fact_purchases
  GROUP BY customer_id
) p USING (customer_id)
GROUP BY c.segment;

Slow queries that are actually grain bugs

The single most common "performance problem" an analyst brings to their tech lead is not a performance problem. It’s a JOIN that’s fanning out into a cross product, producing 100x the expected rows, then slowly aggregating them back down. The query "runs slow" and returns the wrong number — both symptoms of the same underlying bug.

Reflex check before you tune anything: COUNT(*) on each table unjoined, COUNT(*) on the joined result. If the joined count is much larger than either unjoined count, the query isn’t slow — it’s wrong. Fixing the grain will speed it up AND produce correct numbers.

Exercise

You have a query joining dim_customers (1M rows) to fact_sessions (50M rows, ~50 per customer) to fact_purchases (5M rows, ~5 per customer), and it takes 12 minutes. Without running EXPLAIN, what’s the most likely performance root cause and what’s the fix?

Schema hint

Standard star schema, one customer → many sessions, one customer → many purchases.

Expected

A one-paragraph diagnosis naming the fanout and the roll-up fix.

Show solution
Fanout: joining fact_sessions and fact_purchases at the customer grain
produces rows = sessions × purchases per customer = ~50 × 5 = 250 rows
per customer, so the intermediate relation is ~250M rows instead of 50M
or 5M. The fix is to roll up each fact to one row per customer BEFORE
joining them:

WITH s AS (SELECT customer_id, MAX(session_at) AS last_seen,
                  COUNT(*) AS session_count
           FROM fact_sessions GROUP BY customer_id),
     p AS (SELECT customer_id, SUM(amount) AS lifetime_value,
                  COUNT(*) AS purchase_count
           FROM fact_purchases GROUP BY customer_id)
SELECT c.*, s.last_seen, s.session_count, p.lifetime_value, p.purchase_count
FROM dim_customers c
LEFT JOIN s USING (customer_id)
LEFT JOIN p USING (customer_id);

Now each intermediate is 1M rows (one per customer), the final is 1M
rows, and the query runs in seconds.

Once you’ve internalized that most "slow queries" are grain bugs in disguise, the EXPLAIN-index-materialization tactics above will cover almost everything else. The combination is enough to make you the person on the team whose queries "just work" at production scale — without needing to become the DBA.

Build these reflexes against real data.

caseSQL runs 100+ missions against a realistic star schema with planted data-quality issues. Free tier runs in your browser; upgrade to Pro for interview prompts.

Keep reading