Detect and fix grain-confusion fan-out when joining two fact tables at different grains
I'm pulling my hair out. Victoria wants a career-performance sheet for every active Sales rep: first_name, last_name, total career revenue (all years), and total target dollars assigned across every quarter we've tracked. I wrote what I thought was a simple query joining `fact_sales` to `fact_targets` on `employee_id` and summing revenue, but my top rep came back at $2.5M career revenue — the real number should be closer to $200K. Something is very wrong. Figure out what's happening and send me the correct version. 13 active Sales reps. Sort by total_revenue descending.
| Column | Type | Key |
|---|---|---|
| employee_id | INT | PK |
| first_name | TEXT | |
| last_name | TEXT | |
| TEXT | ||
| department | TEXT | |
| title | TEXT | |
| hire_date | TEXT | |
| salary | INT | |
| manager_id | INT | |
| is_active | INT |
| Column | Type | Key |
|---|---|---|
| sale_id | INT | PK |
| employee_id | INT | FK → dim_employees |
| product_id | INT | FK → dim_products |
| region_id | INT | FK → dim_regions |
| sale_date | TEXT | |
| quantity | INT | |
| revenue | REAL | |
| discount | REAL |
| Column | Type | Key |
|---|---|---|
| target_id | INT | PK |
| employee_id | INT | FK → dim_employees |
| quarter | INT | |
| year | INT | |
| target_amount | REAL | |
| actual_amount | REAL |
Each hint you reveal reduces the XP you can earn. Try the query first.
Run Marcus's naive approach first and compare against the ground truth: `SELECT SUM(revenue) FROM fact_sales` returns the company-wide total. Join `fact_sales` to `fact_targets` on `employee_id`, sum revenue, and the rep-level totals will be roughly 12x too big.
The root cause is a grain mismatch. `fact_sales` has a per-sale grain (~2000 rows); `fact_targets` has a per-(employee × quarter × year) grain (~168 rows, ~12 per rep). Joining on `employee_id` alone creates a Cartesian product: every sale row gets duplicated for each of that rep's target rows, so revenue is inflated by roughly the target-row count per rep.
Fix with pre-aggregation, not a bigger JOIN. Build a CTE that SUMs revenue per rep, a second CTE that SUMs target_amount per rep, then LEFT JOIN both to `dim_employees` filtered to `is_active = 1 AND department = 'Sales'`. Use COALESCE on any rep missing from either CTE.