Deterministic Record Linking

44
Deterministic Record Linking University of North Carolina, Chapel Hill Hye-Chung Kum

description

Deterministic Record Linking. University of North Carolina, Chapel Hill Hye-Chung Kum. Example. Exact Match. Approximate Matching I : SSN. Approximate Matching II : DOB. Approximate Matching III : Name. Deterministic Record Linking. Allow for approximate matching - PowerPoint PPT Presentation

Transcript of Deterministic Record Linking

Page 1: Deterministic Record Linking

Deterministic Record Linking

University of North Carolina, Chapel Hill

Hye-Chung Kum

Page 2: Deterministic Record Linking

Example

EISID : E1 EISID : E2 EISID : E3 EISID : E4

ssn : 085-66-9980first name : Sallylast name : HillMI : LDOB : 3/4/1999

ssn : 143-25-9304first name : Emilylast name : BrownMI : KDOB : 6/2/2004

ssn : 354-563-2343first name : Marylast name : JohnsonMI : GDOB : 5/13/1983

ssn : 532-34-9183first name : Davidlast name : FordMI : JDOB : 10/25/1990

SISID : S1 SISID : S2 SISID : S3 SISID : S4

ssn : 085-66-9980first name : Sallylast name : HillMI : LDOB : 3/4/1999

ssn : 143-52-9304first name : Emilylast name : BrownMI : KDOB : 6/2/2004

ssn : 354-563-2343first name : Marylast name : HawkinsMI : JDOB : 5/13/1983

ssn : 532-34-9183first name : Davidlast name : FordMI : JDOB : 10/23/1990

Page 3: Deterministic Record Linking

Exact Match

EISID : E1

ssn : 085-66-9980first name : Sallylast name : HillMI : LDOB : 3/4/1999

SISID : S1

ssn : 085-66-9980first name : Sallylast name : HillMI : LDOB : 3/4/1999

Page 4: Deterministic Record Linking

Approximate Matching I : SSN

EISID : E2ssn : 143-2525-9304first name : Emilylast name : BrownMI : KDOB : 6/2/2004

SISID : S2ssn : 143-5252-9304first name : Emilylast name : BrownMI : KDOB : 6/2/2004

Page 5: Deterministic Record Linking

Approximate Matching II : DOB

EISID : E4ssn : 532-34-9183first name : Davidlast name : FordMI : JDOB : 10/25/199010/25/1990

SISID : S4ssn : 532-34-9183first name : Davidlast name : FordMI : JDOB : 10/23/199010/23/1990

Page 6: Deterministic Record Linking

Approximate Matching III : Name

EISID : E3ssn : 354-563-2343first name : Marylast name : JohnsonJohnsonMI : GGDOB : 5/13/1983

SISID : S3ssn : 354-563-2343first name : Marylast name : HawkinsHawkinsMI : JJDOB : 5/13/1983

Page 7: Deterministic Record Linking

Deterministic Record Linking

Allow for approximate matching Use explicit approximate rules Pros : can control the linkage process Con: difficult to implement Alternative : Probabilistic record linking

– Also approximate matching– However, uses general rules specified by users– Based on total probability – Con: can not control exactly what to consider a match or not– Pros: can use specialized software

Page 8: Deterministic Record Linking

Approximate Matching : DOB

element to element match : date, month, year Allow for one element difference Allow for month and day transposed

DOB : one element

dob1 : 10/2525/1990dob2 : 10/2323/1990

DOB : transpose

dob1 : 11/7/11/7/1995dob2 : 7/11/7/11/1995

Page 9: Deterministic Record Linking

Approximate Matching : Name

First name soundex match First name is approx

– one letter different insert or replace

– and/or substr lsound equal or lname approx

– MI=FI– FI equal

Fsound & Lsound swapped

obs fname kfname mi kmi

1 RUDOLPH RULLDOLPH A A

2 ALIJJAH ALIYYAH M

3 CAROL CAROLYNYN J

4ANGELIQULIQUEE ANGIIE D

5 JOHNNY JOHNNY JRJR L

6 ZACHARYHARY ZACKK L

7 J MMICHAEL MM

8 AANTON CCOUDRAY CC AA

9 AARTHUR AAUTHOR R R

10 EEDWIN EEDDIE

11 GOLDY OWENS A A

Page 10: Deterministic Record Linking

Approximate Matching : Name

obs fname kfname mi kmi lname klname

1 RUDOLPH RULLDOLPH A A SIMARD SIMARD

2 ALIJJAH ALIYYAH M FOSS FOSS

3 CAROL CAROLYNYN J YOUNG YOUNG

4 ANGELIQUELIQUE ANGIIE D OUELLETTE OUELLETTE

5 JOHNNY JOHNNY JRJR L MAYO MAYO

6 ZACHARYHARY ZACKK L ROGERS ROGERS

7 J MMICHAEL MM GALLAGHER GALLAGHER

8 AANTON CCOUDRAY CC AA CYPRESS CYPRESS

9 AARTHUR AAUTHOR R R DAVIS DAVIS

10 EEDWIN EEDDIE KAHKONE KAHKONE

11 GOLDYGOLDY OWENSOWENS A A OWENSOWENS GOLDYGOLDY

Page 11: Deterministic Record Linking

Match on ssn (ssn equal)

1 : dob, fsound equal dob approx

– 2 : dob approx, fsound equal– 3 : dob approx, fname approx– 4 : dob approx, lsound equal, & fsound diff, but MI=FI– 5 : dob approx, lsound equal, & fsound diff, but FI equal– 6 : dob approx, lsound and fsound swapped– 7 : dob approx, lname approx & fsound diff

but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal)

dob mismatch– 8 : fname approx, lsound equal, and dob diff– 9 : fname approx, lsound approx, and dob diff

Page 12: Deterministic Record Linking

Match on ssn (ssn equal)

1 : dob, fsound equal dob approx

– 2 : dob approx, fsound equal– 3 : dob approx, fname approx– 4 : dob approx, lsound equal, & fsound diff, but MI=FI– 5 : dob approx, lsound equal, & fsound diff, but FI equal– 6 : dob approx, lsound and fsound swapped– 7 : dob approx, lname approx & fsound diff

but MI=FI (4 with lname approx rather than equal) or FI equal (5 with lname approx rather than equal)

dob diff– 8 : fname approx, lsound equal, and dob diff– 9 : fname approx, lsound approx, and dob diff

Page 13: Deterministic Record Linking

Approximate Matching : SSN

Digit to digit match Allow for one digit difference Allow for two digit difference if transposed

SSN : one digit

ssn1 : 532-34-99183ssn2 : 532-34-88183

SSN : transpose

ssn1 : 143-2525-9304ssn2 : 143-5252-9304

Page 14: Deterministic Record Linking

Match on ndob (dob+fsound)

ssn missing– 1: lname equal– 2: lname approx

ssn approx– 3: lname equal– 4: lname approx– 5: lname diff

but fname equal

ssn different– 11 : lname equal– 12 : lname approx

lname different– 51: ssn approx– 52: ssn missing

Page 15: Deterministic Record Linking

Match on ndob (dob+fsound)

ssn missing– 1: lname equal– 2: lname approx

ssn approx– 3: lname equal– 4: lname approx– 5: lname diff

but fname equal

ssn different– 11 : lname equal– 12 : lname approx

lname different– 51: ssn approx– 52: ssn missing

Page 16: Deterministic Record Linking

obs SSN kSSN fname kfname lname klname

1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS

2 . . ABEL ABELLOMELIGARCIGARCI

AA LOMELI

3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS

Page 17: Deterministic Record Linking

obs SSN kSSN fname kfname lname klname

1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS

2 . . ABEL ABELLOMELIGARCIGARCI

AA LOMELI

3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS

4243352044

55243352055

44 LENA LENA COOPER COOPER

5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT

Page 18: Deterministic Record Linking

obs SSN kSSN fname kfname lname klname

1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS

2 . . ABEL ABELLOMELIGARCIGARCI

AA LOMELI

3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS

4243352044

55243352055

44 LENA LENA COOPER COOPER

5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT

6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS

Page 19: Deterministic Record Linking

obs SSN kSSN fname kfname lname klname

1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS

2 . . ABEL ABELLOMELIGARCIGARCI

AA LOMELI

3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS

4243352044

55243352055

44 LENA LENA COOPER COOPER

5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT

6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS

7 2447799182 2447788182 AUSTIN AUSTIN AUSTYNAUSTYN TERWILLIGERTERWILLIGER OMEARAOMEARA

8 4899875113 4899875773 ALISIAALISIA ALICE ALICE GRAVESGRAVES WATSONWATSON

9 2399665668 2399665778 ANNAANNA ANAYAANAYA MONTAGUEMONTAGUE BOLDINGBOLDING

Page 20: Deterministic Record Linking

obs SSN kSSN fname kfname lname klname

1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS

2 . . ABEL ABELLOMELIGARCIGARCI

AA LOMELI

3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS

4243352044

55243352055

44 LENA LENA COOPER COOPER

5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT

6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS

7 2447799182 2447788182 AUSTIN AUSTIN AUSTYNAUSTYN TERWILLIGERTERWILLIGER OMEARAOMEARA

8 4899875113 4899875773 ALISIAALISIA ALICE ALICE GRAVESGRAVES WATSONWATSON

9 2399665668 2399665778 ANNAANNA ANAYAANAYA MONTAGUEMONTAGUE BOLDINGBOLDING

Page 21: Deterministic Record Linking

obs SSN kSSN fname kfname lname klname

1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS

2 . . ABEL ABELLOMELIGARCIGARCI

AA LOMELI

3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS

4243352044

55243352055

44 LENA LENA COOPER COOPER

5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT

6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS

7 2447799182 2447788182 AUSTIN AUSTIN AUSTYNAUSTYN TERWILLIGERTERWILLIGER OMEARAOMEARA

8 4899875113 4899875773 ALISIAALISIA ALICE ALICE GRAVESGRAVES WATSONWATSON

9 2399665668 2399665778 ANNAANNA ANAYAANAYA MONTAGUEMONTAGUE BOLDINGBOLDING

10227691655

55227691633

33 BRITTNEY BRITTNEY REVELS REVELS

1124233992423399

131323952442395244

0202 DANIEL DANIEL ROBINSON ROBINSON

1222186482218648

5252 22520602252060

1717 HELEN HELEN HAALL HOOLLERER

1324021242402124

898922256562225656

0404 DEBORAHDEBORAH DEBRADEBRA LEEE LEACHACH

Page 22: Deterministic Record Linking

obs SSN kSSN fname kfname lname klname

1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS

2 . . ABEL ABELLOMELIGARCIGARCI

AA LOMELI

3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS

4243352044

55243352055

44 LENA LENA COOPER COOPER

5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT

6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS

7 2447799182 2447788182 AUSTIN AUSTIN AUSTYNAUSTYN TERWILLIGERTERWILLIGER OMEARAOMEARA

8 4899875113 4899875773 ALISIAALISIA ALICE ALICE GRAVESGRAVES WATSONWATSON

9 2399665668 2399665778 ANNAANNA ANAYAANAYA MONTAGUEMONTAGUE BOLDINGBOLDING

10227691655

55227691633

33 BRITTNEY BRITTNEY REVELS REVELS

1124233992423399

131323952442395244

0202 DANIEL DANIEL ROBINSON ROBINSON

1222186482218648

5252 22520602252060

1717 HELEN HELEN HAALL HOOLLERER

1324021242402124

898922256562225656

0404 DEBORAHDEBORAH DEBRADEBRA LEEE LEACHACH

Page 23: Deterministic Record Linking

obs SSN kSSN fname kfname lname klname

1 244572812 . APPOLONIA APPOLONIA GAVINS GAVINS

2 . . ABEL ABELLOMELIGARCIGARCI

AA LOMELI

3 248511181 . JOSHJOSH JOSHUAJOSHUA PHIPPS PHIPPS

4243352044

55243352055

44 LENA LENA COOPER COOPER

5 2395655188 2395655199 MILES MILES KNIGHT JR.JR. KNIGHT

6 2451193584 2454493584 MARTHA MARTHA LYDALYDA HOPKINSHOPKINS

7 2447799182 2447788182 AUSTIN AUSTIN AUSTYNAUSTYN TERWILLIGERTERWILLIGER OMEARAOMEARA

8 4899875113 4899875773 ALISIAALISIA ALICE ALICE GRAVESGRAVES WATSONWATSON

9 2399665668 2399665778 ANNAANNA ANAYAANAYA MONTAGUEMONTAGUE BOLDINGBOLDING

10227691655

55227691633

33 BRITTNEY BRITTNEY REVELS REVELS

1124233992423399

131323952442395244

0202 DANIEL DANIEL ROBINSON ROBINSON

1222186482218648

5252 22520602252060

1717 HELEN HELEN HAALL HOOLLERER

1324021242402124

898922256562225656

0404 DEBORAHDEBORAH DEBRADEBRA LEEE LEACHACH

14 238995019 . ABIGAHIL ABIGAHIL GARCIAGARCIATREJO TREJO

15 . . APSLEY APSLEY CARLYLE KARYLE

16 . . ABIGAIL ABIGAIL GENTRY KING

17 237999685 . ABIGAIL ABIGAIL RODRIGUEZRINCON HERNANDEZ

18 237998504 . ABIGAYLE ABIGAIL FITZGERALD HERNANDEZ

Page 24: Deterministic Record Linking

Match on name (fname+lname)

ssn missing & dob approx

– 1: MI equal– 7: MI missing– 8: MI not equal

ssn approx– 3: dob equal– dob approx

4: one element 5: transpose

Page 25: Deterministic Record Linking

Match on name (fname+lname)

ssn missing & dob approx

– 1: MI equal– 7: MI missing– 8: MI not equal

ssn approx– 3: dob equal– dob approx

4: one element 5: transpose

obs ssn kssn dob kdob

13626220104

7 32626201047 09/06/0909 09/06/0808

23134141690

6 31314146906 12/09/7575 12/09/7676

3244638105

6 2116381056 07/ 1515/20 07/0707/20

4238013800

33 23801383030 11/1211/12/1412/1112/11/14

524119110

44 2411911033 12/0812/08/9408/1208/12 /94

Page 26: Deterministic Record Linking

obs Type ssn kssn dob kdob fname lname

1 4 362622010473262620104

7 09/06/0909 09/06/0808 MARION MONTAGUE

2 4 313414169063131414690

6 12/09/7575 12/09/7676 WILLIAM JOHNSON

3 4 2446381056211638105

6

07/ 1515/2

0 07/0707/20 WILLIE GRANT

4 5 23801380303238013833

0011/1211/12/1

412/1112/11/1

4 GLADYS SOUTHARD

5 5 241191104424119110

3312/0812/08/9

408/1208/12 /

94 TAYLOR FORD

6 52 272318863327231886

00 09/11/77 . NICOLE PARKER

7 52 5781111773 578111111

3 07/07/88 . ASAJAH ROSS

8 100 120688146612068814

220101/31/00

551010/31/99

99 PATRICIA BANEGAS

9 100 13368078080133680799

88 0101/12/88 0202/12/88 DANIEL ANDRONIC

10 100 1327669052132755905

202/2702/27/8

9 11/1511/15/8

9 VICTORIA HORN

Match on name (fname+lname)

Page 27: Deterministic Record Linking

obs Type ssn kssn dob kdob fname lname

1 4 362622010473262620104

7 09/06/0909 09/06/0808 MARION MONTAGUE

2 4 313414169063131414690

6 12/09/7575 12/09/7676 WILLIAM JOHNSON

3 4 2446381056211638105

6

07/ 1515/2

0 07/0707/20 WILLIE GRANT

4 5 23801380303238013833

0011/1211/12/1

412/1112/11/1

4 GLADYS SOUTHARD

5 5 241191104424119110

3312/0812/08/9

408/1208/12 /

94 TAYLOR FORD

6 52 272318863327231886

00 09/11/77 . NICOLE PARKER

7 52 5781111773 578111111

3 07/07/88 . ASAJAH ROSS

8 100 120688146612068814

220101/31/00

551010/31/99

99 PATRICIA BANEGAS

9 100 13368078080133680799

88 0101/12/88 0202/12/88 DANIEL ANDRONIC

10 100 1327669052132755905

202/2702/27/8

9 11/1511/15/8

9 VICTORIA HORN

Match on name (fname+lname)

Page 28: Deterministic Record Linking

link

Put together all links found Identify indirect duplicates (type2>10000)

– i.e. both EISID1 & EISID2 link to identical SISID1– Consider indirect duplicates on both EIS & SIS

Create unique link and indirect duplicate files– Keep only the first id in data file link– Create indirect duplicates files

dupeis2 & dupsis2 TODO : explore indirect duplicates

Page 29: Deterministic Record Linking

Create unique list of EIS & SIS

Generate unique full list of each set of ids– use linkage info– Link in the duplicates (dupeis & dupsis)– TODO : link in the indirect duplicates– eis & sis

Page 30: Deterministic Record Linking

Data flow

Link eis to sis

ueis.sas7bdat usis.sas7bdat

link.sas7bdat

dupeis2.sas7bdat

4,308,863

dupsis2.sas7bdat

eisid.sas7bdat sisid.sas7bdat

dupeis.sas7bdat

dupsis.sas7bdat

eis.sas7bdat sis.sas7bdat

duplicatesunduplicated

unique records

4,277,40299%

1,888,747

1,638,11287%

31,461

250,635

1,173,404

4,308,86328%

1,888,74774%

1270

493

27% 72%

Page 31: Deterministic Record Linking

Type of links

Exact match Approx match (miss) Freq % cum%

ssn, dob, fsound 781094 66.57% 66.57%ssn, fsound dob 52173 4.45% 71.01%ssn dob, fsound 10959 0.93% 71.95%ssn, lsound fname (dob mismatch) 9320 0.79% 72.74%ssn other 7095 0.60% 73.35%dob, fsound, lname (ssn=.) 251124 21.40% 94.75%dob, fsound lname 16189 1.38% 96.13%dob, fsound, lname ssn 23653 2.02% 98.14%dob, fsound, lname (ssn mismatch) 15544 1.32% 99.47%dob, fsound other 4398 0.37% 99.84%fname, lname other 1855 0.16% 100.00%TOTAL 1173404 100.00%

Page 32: Deterministic Record Linking

Type of duplicates and links

Type EIS SIS

freq % cum % freq % cum %

DLD 3270 0.08% 0.08% 4345 0.23% 0.23%

DLX 8790 0.20% 0.28% 228039 12.07% 12.30%

DXX 19401 0.45% 0.73% 18251 0.97% 13.27%

PLD 3221 0.07% 0.80% 3221 0.17% 13.44%

PLX 8706 0.20% 1.01% 185066 9.80% 23.24%

PXX 19198 0.45% 1.45% 16929 0.90% 24.14%

XLD 185066 4.30% 5.75% 8706 0.46% 24.60%

XLX 976411 22.66% 28.41% 976411 51.70% 76.29%

XXX 3084800 71.59% 100.00% 447779 23.71% 100.00%

TOT 4308863 100.00% 1888747 100.00%

Page 33: Deterministic Record Linking

Number of Duplicates

dups EIS SIS

freq sets % cum % freq sets % cum %

1 4246277 4246277 98.55% 98.55% 1432896 1432896 75.86% 75.86%2 61600 30800 1.43% 99.98% 338251 169125 17.91% 93.77%3 942 314 0.02%100.00% 86379 28793 4.57% 98.35%4 44 11 0.00%100.00% 22928 5732 1.21% 99.56%5 6020 1204 0.32% 99.88%6 1662 277 0.09% 99.97%7 497 71 0.03% 99.99%8 96 12 0.01% 100.00%9 18 2 0.00% 100.00%

TOT 4308863 4277402 100.00% 1888747 1638112 100.00%

Page 34: Deterministic Record Linking

Implementation details

Ndob & name must be looped – multiple matches

Too many match on name – use half of ssn– Overlap for transpose

Page 35: Deterministic Record Linking

Basic Process

Unduplicate EIS (dupeis) Unduplicate SIS (dupsis) Link unduplicated EIS & SIS (link) Generate unique full list of each set of ids (list)

– use linkage info– Link in the duplicates– eis & sis

Page 36: Deterministic Record Linking

Unduplication

Same as matching between different system Except, match the database to itself

– i.e. EIS to EIS, SIS to SIS

Randomly select one as Primary– TODO: for those not linked using primary ID, try

with duplicate ID

TODO: explore indirect duplicate links

Page 37: Deterministic Record Linking

Conclusion

Future work : – indirect duplicates– Link using duplicates

SSN have been changed from real data

Page 38: Deterministic Record Linking

Thank You !

Page 39: Deterministic Record Linking

Type of id

first letter: – P : primary id with duplicates– D : duplicates (primary info given with prefix ‘l’)– X : no duplicates

second letter: link status– L: linked– X: no linked id

third letter: duplicates status of the linked id– D: duplicates exist for the linked id– X: no duplicates for the linked id

Page 40: Deterministic Record Linking

EIS & SIS Table

Unique full is of EIS (or SIS) ids Type : type of id (XXX) – see next slide All eis info have no prefix All sis info have prefix ‘k’ Prefix ‘l’ is the link id info freqeis & freqsis : # of duplicate ids Pindid (eis) & pkindid (sis) is the primary id indid1-indid3 & kindid1-kindid8

Page 41: Deterministic Record Linking

Link type

sdiff : # digits different in ssn– -1 : one or both ssn is missing– 2 : two digits are transposed– 10 : two digits are different but not transposed

ddiff : diff in dob– -1 : one or both dob is missing– 2 : date and month is transposed– 3 : date, month and year are different– 4 : date and month are different

Fdiff (ldiff) : difference in first (last) name– -1 : one or both are missing– 1 : one letter difference (INDEL or REPL)– 100 : one is a substring of the other– 101 : one letter diff & substring

Page 42: Deterministic Record Linking

Duplicate type

If duplicate id– Primary id info is given with prefix “l”– Duplicate type

Lsdiff, lddiff, lfdiff, & lldiff

If primary id– # of duplicates : freqeis & freqsis– Duplicate ids

Indid1-indid3 (eis) & kindid1-kindid8 (sis)

Page 43: Deterministic Record Linking

Other tables

Link – Linkage between the primary eis & sis ids

dupeis & dupsis– List of duplicates with primary id

Page 44: Deterministic Record Linking

Data flow

eisid: 4,308,863– ueis (4,277,402)+dupeis (31,461) : 99%

sisid: 1,888,747– usis (1,638,112)+dupsis (250,635) : 87%

Link : 1,173,404 (eis: 27%, sis: 72%)– dupeis2 (1,270) + dupsis2(493)

EIS: 4,308,863 (28%) SIS: 1,888,747 (74%)