Every nonprofit with more than a few thousand donors has a duplication problem. Sometimes it is small — a stray Jen vs Jennifer in the spring appeal list. Sometimes it is large enough to misstate annual giving on the 990. The work of finding and merging those duplicates lives almost entirely in SQL, because the underlying matching logic is row-level set work that no CRM does well by default. This piece is a working playbook — five layers of matching, a survivorship strategy, and the operational guardrails that keep a dedup pass from becoming a one-way data destruction event.
The schema we will work against has three tables. donors(donor_id, first_name, last_name, email, phone, address1, city, state, zip, created_at) is the master list. gifts(gift_id, donor_id, amount, gift_date, campaign_code) records every contribution. donor_aliases(alias_id, donor_id, source, alias_text) is the audit table you will build along the way — every time you decide two records are the same person, you log the prior identifier here so the merge is reversible. Reversibility is the single feature that distinguishes a serious dedup project from a destructive one.
Why donor records duplicate
The duplicates are not random. They cluster around three intake patterns. Multi-channel intake is the biggest source — the same person gives once on the website (autocompleted "Jennifer M Lopez", lowercased email), then mails a check next quarter (data-entry tech types "Jen Lopez", uppercase email), then attends a gala where the badge says "Jenn Lopez". Three records, one human. Household conflation is the second — spouses share an address and sometimes a phone, and a careless import treats the household as a single donor or splits a couple into duplicates depending on which way the schema bends. Imports from acquired lists are the third — when a chapter merger or a lapsed-donor reactivation campaign brings in a third-party list, the new rows almost always overlap the existing master.
Notice that all three of these produce duplicates whose fields do not match exactly. Nicknames, casing drift, missing middle initials, mistyped ZIPs, gendered last-name changes. A naive GROUP BY first_name, last_name will miss most of them. That is why the playbook is layered.
The cost of dupes
Three concrete costs justify the work. They are also the three numbers a dev director will quote in a board meeting if asked why a data analyst is spending a week on this.
- Donor count overstatement. If 8% of donor records are duplicates, the public-facing "donors served" number is inflated by 8%. Foundations notice when the numbers in the grant report do not square with the data warehouse.
- Solicitation friction. A donor who receives the spring appeal addressed to three slightly different versions of their own name has a much shorter fuse. The complaint goes to the board chair, not the data team.
- Receipt and tax-letter risk. Year-end IRS receipt totals must match the donor of record. If a single human has gifts split across three records, the receipt under any one of them undercounts. The donor calls. The CPA calls. The auditor calls.
Layer 1 — Exact match (the false floor)
Every dedup pass starts with the trivially obvious matches: rows where every important field is byte-identical. They almost never exist in raw data, but the query is the floor — a sanity check, and the place to confirm the schema before layering anything more interesting.
-- Layer 1: rows that are byte-identical on the high-signal fields.
-- In practice this finds maybe 1-2% of true duplicates. The point is to
-- shake out CRM ingestion bugs (the same record imported twice).
SELECT
LOWER(email) AS norm_email,
LOWER(first_name) || ' ' || LOWER(last_name) AS norm_name,
COUNT(*) AS rec_count,
STRING_AGG(donor_id::text, ', ') AS donor_ids
FROM donors
WHERE email IS NOT NULL
GROUP BY 1, 2
HAVING COUNT(*) > 1
ORDER BY rec_count DESC;Two design choices in that query are worth calling out. The first: even at "exact match" we are already lowercasing, because the email pipeline almost certainly mixes cases. Treating Jen@example.com and jen@example.com as different rows is a CRM bug, not a feature. The second: we group by the pair (email, name) rather than just email — because a shared family email (smiths@example.com for the parents) is a real-world pattern and we do not want to merge a household into a single donor by accident.
Layer 2 — Normalized matching
Most "exact" dupes are actually exact-after-normalization. Strip whitespace, lowercase everything, drop punctuation, and a surprising fraction of the duplicate set falls out. The trick is to compute the normalized fields once, in a CTE, and then group on those.
WITH normalized AS (
SELECT
donor_id,
LOWER(TRIM(email)) AS n_email,
LOWER(REGEXP_REPLACE(TRIM(first_name), '[^a-z]', '', 'gi')) AS n_first,
LOWER(REGEXP_REPLACE(TRIM(last_name), '[^a-z]', '', 'gi')) AS n_last,
REGEXP_REPLACE(COALESCE(zip, ''), '\D', '', 'g') AS n_zip5
FROM donors
)
SELECT
n_email,
n_first,
n_last,
LEFT(n_zip5, 5) AS n_zip5,
COUNT(*) AS rec_count,
ARRAY_AGG(donor_id ORDER BY donor_id) AS donor_ids
FROM normalized
GROUP BY 1, 2, 3, 4
HAVING COUNT(*) > 1
ORDER BY rec_count DESC;Two things change here. We strip every non-alpha character from names — apostrophes, hyphens, periods after middle initials, accent marks if your collation passes them through — because every one of those is a vector for two genuine duplicates to look unequal. We also normalize the ZIP to a five-digit prefix; ZIP+4 variants are a common reason a household pair fails to match on a literal =.
Extend the normalized query above to also flag rows where the email matches but the normalized name does not — those are candidates for household merges (one email shared between spouses) rather than donor merges. Return one row per matching email pair with a count of distinct normalized names sharing that email.
donors(donor_id, first_name, last_name, email, zip). Use the same normalization as the example.
One row per email that has 2+ distinct normalized names. Columns: n_email, distinct_name_count, donor_ids. Order by distinct_name_count desc.
Show solution
WITH normalized AS (
SELECT
donor_id,
LOWER(TRIM(email)) AS n_email,
LOWER(REGEXP_REPLACE(TRIM(first_name), '[^a-z]', '', 'gi'))
|| ' '
|| LOWER(REGEXP_REPLACE(TRIM(last_name), '[^a-z]', '', 'gi')) AS n_full
FROM donors
WHERE email IS NOT NULL
)
SELECT
n_email,
COUNT(DISTINCT n_full) AS distinct_name_count,
ARRAY_AGG(donor_id ORDER BY donor_id) AS donor_ids
FROM normalized
GROUP BY n_email
HAVING COUNT(DISTINCT n_full) > 1
ORDER BY distinct_name_count DESC;Layer 3 — Phonetic codes
Normalization gets you the duplicates that are typographically close. It does not get you the ones where the name was typed by ear — a phone-bank volunteer transcribing a pledge, an event check-in writing what they hear. "Catherine" and "Kathryn" share zero characters in their first three letters but hash to the same Soundex code. Phonetic indexing is the cheapest way to catch that class of dupe.
-- Layer 3: phonetic match on last name within the same ZIP5.
-- The ZIP5 anchor is essential — Soundex collisions are common
-- (Smith/Smyth, but also Lee/Lay), so we need a second axis to keep
-- the candidate set tractable.
SELECT
SOUNDEX(last_name) AS last_soundex,
LEFT(REGEXP_REPLACE(zip, '\D', '', 'g'), 5) AS zip5,
COUNT(*) AS rec_count,
ARRAY_AGG(donor_id || ' (' || first_name || ' ' || last_name || ')'
ORDER BY donor_id) AS candidates
FROM donors
WHERE last_name IS NOT NULL
AND zip IS NOT NULL
GROUP BY 1, 2
HAVING COUNT(*) > 1
ORDER BY rec_count DESC;Where Soundex falls short — and it does, on names of non-English origin in particular — Metaphone and Double Metaphone are the next step up. Postgres ships both via fuzzystrmatch. Snowflake users typically wrap a JavaScript UDF; BigQuery users live with Soundex. The shape of the query is identical: hash the name, group by the hash within a ZIP or city, surface clusters with more than one row.
Layer 4 — Edit distance with Levenshtein
Phonetic codes catch sound-alikes; edit distance catches typo-alikes. Levenshtein returns the minimum number of single-character edits — insert, delete, substitute — that turn one string into another. "Marshall" vs "Marshal" is distance 1. "Stephen" vs "Steven" is distance 2. The metric is well-understood and ships in every modern warehouse, so the SQL is the easy part. Choosing the threshold and the candidate gate is where the work lives.
-- Layer 4: Levenshtein within a ZIP5 cohort.
-- A self-join on donors is O(n^2) without a gating predicate, so we
-- anchor on ZIP5 first to keep candidate pairs tractable.
SELECT
a.donor_id AS donor_a,
b.donor_id AS donor_b,
a.first_name || ' ' || a.last_name AS name_a,
b.first_name || ' ' || b.last_name AS name_b,
LEVENSHTEIN(LOWER(a.last_name), LOWER(b.last_name)) AS last_dist,
LEVENSHTEIN(LOWER(a.first_name), LOWER(b.first_name)) AS first_dist
FROM donors a
JOIN donors b
ON a.donor_id < b.donor_id
AND LEFT(REGEXP_REPLACE(a.zip, '\D', '', 'g'), 5)
= LEFT(REGEXP_REPLACE(b.zip, '\D', '', 'g'), 5)
WHERE LEVENSHTEIN(LOWER(a.last_name), LOWER(b.last_name)) <= 2
AND LEVENSHTEIN(LOWER(a.first_name), LOWER(b.first_name)) <= 2
ORDER BY last_dist + first_dist, name_a;Two thresholds are conservative defaults. Distance ≤ 1 on last name plus distance ≤ 2 on first name catches the majority of single-keystroke typos and short nicknames without a flood of false positives. The right threshold for your data is empirical — generate the candidate set, sample 50 pairs, count how many are real duplicates, and tune. Most analysts settle around (1, 2) for last + first; teams with very dirty intake data go to (2, 3) and pay for it with a longer review queue.
Build a candidate query that finds donor pairs where the email local-part (everything before the @) has Levenshtein distance ≤ 1 and the last name is identical (case-insensitive). This catches the "jlopez@gmail.com vs jlopez1@gmail.com" pattern that the ZIP5 gate misses for donors who share a household.
donors(donor_id, first_name, last_name, email). Use SPLIT_PART(email, '@', 1) for the local-part (Postgres) — Snowflake/BigQuery have analogous string-split helpers.
One row per donor pair (a.donor_id < b.donor_id) meeting the criteria. Columns: donor_a, donor_b, email_a, email_b, last_name, local_dist.
Show solution
WITH d AS (
SELECT
donor_id,
last_name,
email,
SPLIT_PART(email, '@', 1) AS email_local
FROM donors
WHERE email IS NOT NULL
)
SELECT
a.donor_id AS donor_a,
b.donor_id AS donor_b,
a.email AS email_a,
b.email AS email_b,
a.last_name AS last_name,
LEVENSHTEIN(a.email_local, b.email_local) AS local_dist
FROM d a
JOIN d b
ON a.donor_id < b.donor_id
AND LOWER(a.last_name) = LOWER(b.last_name)
WHERE LEVENSHTEIN(a.email_local, b.email_local) <= 1
ORDER BY local_dist, a.donor_id;Layer 5 — Composite scoring
No single layer is decisive. The serious version of donor dedup is a composite score across four to six signals — name similarity, email similarity, address similarity, phone match, household-anchor match — each weighted by how informative it is, with a threshold above which a pair is auto-merged and a band below which a pair goes to a review queue.
-- Layer 5: a composite score in [0, 100]. Each signal contributes
-- a fixed weight; weights are tuned empirically and version-controlled
-- (think dbt model with an annotated yaml of weights).
WITH pairs AS (
SELECT
a.donor_id AS donor_a, b.donor_id AS donor_b,
LEVENSHTEIN(LOWER(a.last_name), LOWER(b.last_name)) AS last_dist,
LEVENSHTEIN(LOWER(a.first_name), LOWER(b.first_name)) AS first_dist,
CASE WHEN LOWER(a.email) = LOWER(b.email) THEN 1 ELSE 0 END AS email_eq,
CASE WHEN REGEXP_REPLACE(a.phone, '\D', '', 'g')
= REGEXP_REPLACE(b.phone, '\D', '', 'g') THEN 1 ELSE 0 END AS phone_eq,
CASE WHEN LEFT(REGEXP_REPLACE(a.zip, '\D', '', 'g'), 5)
= LEFT(REGEXP_REPLACE(b.zip, '\D', '', 'g'), 5) THEN 1 ELSE 0 END AS zip_eq,
CASE WHEN SOUNDEX(a.last_name) = SOUNDEX(b.last_name) THEN 1 ELSE 0 END AS soundex_eq
FROM donors a
JOIN donors b ON a.donor_id < b.donor_id
AND LEFT(REGEXP_REPLACE(a.zip, '\D', '', 'g'), 5)
= LEFT(REGEXP_REPLACE(b.zip, '\D', '', 'g'), 5)
)
SELECT
donor_a, donor_b,
-- Each signal contributes up to its weight; max possible = 100.
(CASE WHEN last_dist = 0 THEN 30 WHEN last_dist <= 2 THEN 20 ELSE 0 END) +
(CASE WHEN first_dist = 0 THEN 20 WHEN first_dist <= 2 THEN 12 ELSE 0 END) +
(email_eq * 25) +
(phone_eq * 15) +
(zip_eq * 5) +
(soundex_eq * 5) AS match_score
FROM pairs
WHERE
-- Pre-filter: at least one strong signal must fire, or we waste cycles.
(last_dist <= 2 OR email_eq = 1 OR phone_eq = 1)
ORDER BY match_score DESC;A score above ~70 with no contradicting field (different birthdate, different known-good phone) is a safe auto-merge. A score in the 40–70 band goes to a review queue — a human looks at the candidate pair, picks merge or no-merge, and the decision feeds back into the audit log. Below 40, drop. Calibrate the bands against a hand-labeled validation set of a few hundred pairs; the weights and thresholds drift over time as the underlying intake patterns change.
Picking the surviving record
Once two records are flagged as the same person, you have to choose which one survives. The survivorship rule is a deterministic policy that picks one row per cluster as the master and folds the others into it. The rule should be written down once and applied uniformly — that is the difference between a defensible dedup pass and one a board member can rip apart in a meeting.
-- A practical survivorship policy: keep the row with the most non-null
-- high-signal fields, breaking ties by lifetime giving, then by oldest
-- created_at (preserves the historical donor_id whenever possible).
WITH cluster_members AS (
SELECT
cluster_id, -- assigned by the dedup workflow
d.donor_id,
d.created_at,
(CASE WHEN d.email IS NOT NULL THEN 1 ELSE 0 END
+ CASE WHEN d.phone IS NOT NULL THEN 1 ELSE 0 END
+ CASE WHEN d.address1 IS NOT NULL THEN 1 ELSE 0 END
+ CASE WHEN d.zip IS NOT NULL THEN 1 ELSE 0 END) AS field_completeness,
COALESCE((SELECT SUM(amount) FROM gifts g WHERE g.donor_id = d.donor_id), 0)
AS lifetime_giving
FROM donor_clusters c
JOIN donors d ON d.donor_id = c.donor_id
)
SELECT DISTINCT ON (cluster_id)
cluster_id,
donor_id AS surviving_donor_id
FROM cluster_members
ORDER BY
cluster_id,
field_completeness DESC,
lifetime_giving DESC,
created_at ASC; -- oldest record wins tiesTwo notes. First, DISTINCT ON is Postgres-specific; in Snowflake or BigQuery you would use a ROW_NUMBER() OVER (PARTITION BY cluster_id ORDER BY ...) window with a WHERE rn = 1 filter. Same shape, different syntax. Second, never delete the merged-away rows. Soft-delete by setting a merged_into foreign key on each non-surviving record pointing at the surviving one. Future imports will hit those tombstone rows when they collide on legacy donor_id, and the system can resolve them automatically. Reversibility is the lever that lets you sleep at night.
Operationalizing the dedup workflow
A one-off SQL pass is a science project. The dedup workflow becomes infrastructure when three pieces are in place:
- Staging table. Every candidate pair lands in
donor_dedup_candidates(pair_id, donor_a, donor_b, match_score, signals, surfaced_at)with the per-signal scores. The composite query writes here. The review UI reads from here. The auto-merge job reads from here above the threshold. - Review queue. A simple two-button UI (merge / not the same person) over the rows in the 40–70 band. Each decision writes to
donor_dedup_decisions(pair_id, decision, decided_by, decided_at, notes). The decisions table is the training set for tuning thresholds — once you have a few hundred labeled pairs, you can compute precision and recall per band and adjust. - Audit log. The
donor_aliases(alias_id, donor_id, source, alias_text, merged_at)table records every merge so a future analyst can trace why donor 4471 used to be three separate records. This table is also what powers the "did this person give before under a different name?" lookup that fundraisers ask about more than any other.
The composite query reruns weekly — on a schedule, not manually. New donor records flow in through the week; old ones get edits; the candidate set shifts. Treating the dedup pass as a recurring batch (a dbt build --select donor_dedup or a cron line) is what turns it from a project into a process.
Gotchas that bite every dedup project
These are the failure modes that show up in every nonprofit dedup retrospective. Knowing them in advance is most of the battle.
- Suffix collisions. "John Smith Jr" and "John Smith Sr" at the same address have Levenshtein distance 0 on the canonical name. If the schema does not capture the suffix, the father and son will merge. Strip suffixes only after preserving them in a separate column.
- Married-name changes. A donor who gives under their birth name in 2018 and their married name in 2024 has zero last-name overlap. The composite score with email and phone weight is what saves you here — the name layer alone will not.
- Shared family emails.
smiths@example.comcovers a married couple. Aggressive email-based merging will fold two donors into one; treat email as a strong but not decisive signal, and require a name signal in agreement before auto-merging. - Anonymizing imports. A pledge-card import sometimes comes through with
first_name = "Anonymous"and a generated email. These rows should be flagged and excluded from the candidate pool — they will collide with each other for the wrong reasons. - Data-entry placeholders.
999-999-9999for phone,00000for ZIP,noemail@example.com. Build a list of obvious placeholders and treat a placeholder match as zero signal, not a positive one. - Time-shifted addresses. A donor who moved between two gifts has a different address but is the same person. Address contributes weight only when it agrees; absence of agreement is not evidence of difference.
The single best instinct to develop is bias toward the review queue. When in doubt, surface the pair to a human. The cost of an extra item in the review queue is a few seconds of reviewer time. The cost of an incorrect auto-merge — collapsing two real donors into one and losing the giving history of one of them — is a phone call to the development director and a multi-hour reconstruction job. Set the auto-merge threshold high. Trust the queue.
If you want hands-on practice with the underlying patterns — Levenshtein, Soundex, candidate-set gating, composite scoring — the AML / Fraud Analytics path on caseSQL uses the same techniques against a sanctions-screening problem. Different domain, identical SQL. Once you are fluent in the matching layer, swapping the data is a thirty-minute job.
Combine layers 4 and 5: write a single query that surfaces donor pairs with a composite match_score ≥ 50, ordered descending. Anchor on ZIP5; use the same weights as the example (last 30/20, first 20/12, email 25, phone 15, ZIP 5, Soundex 5). Return donor_a, donor_b, name_a, name_b, match_score.
donors(donor_id, first_name, last_name, email, phone, zip). Use LEVENSHTEIN, SOUNDEX, and REGEXP_REPLACE as in the prior examples.
One row per qualifying pair. Columns: donor_a, donor_b, name_a, name_b, match_score. Order by match_score desc, then donor_a.
Show solution
WITH pairs AS (
SELECT
a.donor_id AS donor_a, b.donor_id AS donor_b,
a.first_name || ' ' || a.last_name AS name_a,
b.first_name || ' ' || b.last_name AS name_b,
LEVENSHTEIN(LOWER(a.last_name), LOWER(b.last_name)) AS last_dist,
LEVENSHTEIN(LOWER(a.first_name), LOWER(b.first_name)) AS first_dist,
CASE WHEN LOWER(a.email) = LOWER(b.email) THEN 1 ELSE 0 END AS email_eq,
CASE WHEN REGEXP_REPLACE(a.phone, '\D', '', 'g')
= REGEXP_REPLACE(b.phone, '\D', '', 'g') THEN 1 ELSE 0 END AS phone_eq,
CASE WHEN SOUNDEX(a.last_name) = SOUNDEX(b.last_name) THEN 1 ELSE 0 END AS soundex_eq
FROM donors a
JOIN donors b ON a.donor_id < b.donor_id
AND LEFT(REGEXP_REPLACE(a.zip, '\D', '', 'g'), 5)
= LEFT(REGEXP_REPLACE(b.zip, '\D', '', 'g'), 5)
)
SELECT
donor_a, donor_b, name_a, name_b,
(CASE WHEN last_dist = 0 THEN 30 WHEN last_dist <= 2 THEN 20 ELSE 0 END) +
(CASE WHEN first_dist = 0 THEN 20 WHEN first_dist <= 2 THEN 12 ELSE 0 END) +
(email_eq * 25) +
(phone_eq * 15) +
(5) + -- ZIP5 already enforced by the join
(soundex_eq * 5) AS match_score
FROM pairs
WHERE
(CASE WHEN last_dist = 0 THEN 30 WHEN last_dist <= 2 THEN 20 ELSE 0 END) +
(CASE WHEN first_dist = 0 THEN 20 WHEN first_dist <= 2 THEN 12 ELSE 0 END) +
(email_eq * 25) + (phone_eq * 15) + 5 + (soundex_eq * 5)
>= 50
ORDER BY match_score DESC, donor_a;