Healthcare analytics has the same SQL syntax as every other analyst domain and almost none of the same habits. The data shape is different (encounters, not events; members, not customers), the definitions are regulated (CMS specs, HEDIS rules), and the mistakes are costly in ways a marketing analyst never experiences (HIPAA audits, CMS quality-report rework, misidentified patients). This post is the working shape of the role — the five query patterns a healthcare analyst writes every week.
Examples use a simplified schema: dim_patients, dim_providers, dim_departments, fact_encounters (admit/discharge records), fact_quality_metrics (observations on encounters). This is close to the caseSQL provider path schema; if you’ve done those missions, the shape will be familiar.
Why healthcare SQL is its own discipline
Three things separate healthcare analytics from consumer / product analytics:
- Regulated definitions. "30-day readmission rate" isn’t whatever the analyst thinks; it’s a CMS-specified measure with inclusion/exclusion rules. Getting the definition wrong isn’t "wrong number" — it’s "rework the CMS report."
- Member identity is fragile. Patients can appear under different IDs across systems (EHR, claims, registration), and matching rules are imperfect. The dedup logic that a marketing analyst uses on emails doesn’t translate.
- PHI. Protected Health Information. Exporting a CSV with 10 rows of patient names to your laptop for "quick analysis" is a fireable offense in most organizations and a regulated breach under HIPAA.
The HEDIS measure shape
HEDIS (Healthcare Effectiveness Data and Information Set) is the quality-measurement standard payers report to NCQA. Every measure has the same shape: a denominator (the population eligible for the measure) and a numerator (the subset that received the appropriate care). The rate is numerator / denominator.
-- HEDIS-style: Breast cancer screening rate for women 50-74 in 2025
WITH denominator AS (
-- Women aged 50-74 with continuous enrollment during the measurement year
SELECT patient_id
FROM dim_patients
WHERE gender = 'F'
AND EXTRACT(YEAR FROM AGE(CURRENT_DATE, dob)) BETWEEN 50 AND 74
AND enrollment_start <= '2024-01-01'
AND enrollment_end >= '2025-12-31'
),
numerator AS (
-- The subset of the denominator who had a mammogram in 2024 or 2025
SELECT DISTINCT d.patient_id
FROM denominator d
JOIN fact_encounters e USING (patient_id)
WHERE e.cpt_code IN ('77057', '77063', '77067')
AND e.service_date BETWEEN '2024-01-01' AND '2025-12-31'
)
SELECT
(SELECT COUNT(*) FROM denominator) AS denom,
(SELECT COUNT(*) FROM numerator) AS numer,
(SELECT COUNT(*) FROM numerator) * 100.0 /
NULLIF((SELECT COUNT(*) FROM denominator), 0) AS screening_rate_pct;30-day readmissions — the definition actually matters
Readmission rate is the most-reported and most-frequently-miscomputed healthcare metric. "Patient discharged, came back within 30 days" sounds simple; the real CMS definition has:
- Index admission type matters. Planned readmissions (scheduled chemotherapy) aren’t counted. Certain transfer patterns are excluded.
- Unit matters. Is it readmissions per patient, or per index admission? The same patient with three readmissions in a year counts three times, not once.
- Denominator matters. Only specific index-admission diagnoses (heart failure, AMI, pneumonia, COPD, CABG, THA/TKA) count for the CMS Hospital Readmissions Reduction Program.
-- All-cause 30-day readmission rate, self-join on the encounters table
WITH index_admissions AS (
SELECT
encounter_id AS index_encounter_id,
patient_id,
discharge_date
FROM fact_encounters
WHERE encounter_type = 'inpatient'
AND discharged_alive = true
AND discharge_date BETWEEN '2025-01-01' AND '2025-11-30'
),
readmits AS (
SELECT
i.index_encounter_id,
i.patient_id,
MIN(r.admit_date) AS readmit_date
FROM index_admissions i
JOIN fact_encounters r
ON r.patient_id = i.patient_id
AND r.encounter_type = 'inpatient'
AND r.admit_date > i.discharge_date
AND r.admit_date <= i.discharge_date + INTERVAL '30 days'
AND r.encounter_id <> i.index_encounter_id
GROUP BY i.index_encounter_id, i.patient_id
)
SELECT
COUNT(*) FILTER (WHERE r.index_encounter_id IS NOT NULL) * 100.0 /
COUNT(*) AS readmit_rate_pct,
COUNT(*) AS index_admissions
FROM index_admissions i
LEFT JOIN readmits r USING (index_encounter_id);Claims vs clinical: the reconciliation problem
Healthcare organizations run on two parallel data systems: claims (what the payer was billed) and clinical (what the EHR recorded). They describe the same encounters and disagree constantly. Reconciling them is a standing analyst task.
- Timing disagrees. Claim service-date can be the discharge date; clinical admit-date is when the patient came in. For a 7-day stay those are different weeks.
- Counts disagree. Claims may bundle multiple clinical encounters into one; the EHR may have "visits" that claims doesn’t bill.
- Diagnoses disagree. Clinical ICD codes include problem list; claim ICD codes are only the billed ones. The claim underrepresents the clinical complexity.
The pattern: FULL OUTER JOIN the two sources on patient_id + encounter date window, then categorize every row as "in both," "claims only," or "clinical only." The non-matching rows ARE the analysis — they tell you where the two systems drift.
PHI: the handling rules that keep you employed
A short list that every healthcare analyst internalizes in week one. These aren’t guidelines; breaking them has gotten analysts walked out by security.
- Never export patient-identified data to your laptop. Not even "for a quick Excel pivot." The moment raw PHI leaves the warehouse, you’ve created a breach risk that legal will now spend a month chasing.
- De-identify before sharing with non-clinical staff. Replace patient_id with a hashed surrogate; drop name, DOB, MRN. The HIPAA Safe Harbor list has 18 identifiers to strip.
- Row counts below 11 round up or are suppressed. In public-facing reports, any cell with n<11 patients is typically reported as "<11" to prevent re-identification. Internal numbers can be exact; external ones can’t.
- Audit log everything. Most warehouses log queries against PHI-tagged tables. Assume your queries are reviewed. Don’t run "SELECT * FROM patients" out of curiosity.
Compute the 30-day readmission rate by admitting department, but apply the small-cell suppression rule (show "<11" as the count when a department has fewer than 11 index admissions in the window).
fact_encounters(encounter_id, patient_id, department_id, encounter_type, admit_date, discharge_date, discharged_alive), dim_departments(department_id, name).
One row per department. Columns: department_name, index_admissions_display (either the count or "<11"), readmit_rate_pct (NULL if index_admissions < 11).
Show solution
WITH ia AS (
SELECT encounter_id, patient_id, department_id, discharge_date
FROM fact_encounters
WHERE encounter_type = 'inpatient'
AND discharged_alive = true
AND discharge_date BETWEEN '2025-01-01' AND '2025-11-30'
),
r AS (
SELECT ia.encounter_id
FROM ia
JOIN fact_encounters f
ON f.patient_id = ia.patient_id
AND f.encounter_type = 'inpatient'
AND f.admit_date > ia.discharge_date
AND f.admit_date <= ia.discharge_date + INTERVAL '30 days'
AND f.encounter_id <> ia.encounter_id
GROUP BY ia.encounter_id
),
rates AS (
SELECT
ia.department_id,
COUNT(*) AS n,
COUNT(r.encounter_id) * 100.0 / COUNT(*) AS readmit_rate_pct
FROM ia
LEFT JOIN r USING (encounter_id)
GROUP BY ia.department_id
)
SELECT
d.name AS department_name,
CASE WHEN rates.n < 11 THEN '<11' ELSE rates.n::text END AS index_admissions_display,
CASE WHEN rates.n < 11 THEN NULL ELSE rates.readmit_rate_pct END AS readmit_rate_pct
FROM rates
JOIN dim_departments d USING (department_id)
ORDER BY d.name;Healthcare SQL rewards analysts who internalize the domain constraints before the syntax. A perfectly-written query against a wrong HEDIS definition is worse than a slower query against the right one — because the first one ships a number that legal, compliance, and CMS will all have to live with.