Method and system for processing data records

72
US007403942B1 (12) United States Patent (10) Patent N0.: US 7,403,942 B1 Bayliss (45) Date of Patent: Jul. 22, 2008 (54) METHOD AND SYSTEM FOR PROCESSING OTHER PUBLICATIONS DATA RECORDS Eike Schallehn et al., “Advanced Grouping and Aggregation for Data (75) Inventor; David Bayliss, Delray Beach, FL (U S) Integration,” Department of Computer Science, Paper ID: 222, pp. 1-16. (73) Assignee: Seisint, Inc., Boca Raton, FL (US) (Continued) ( * ) Notice: Subject to any disclaimer, the term of this _ _ patent is extended or adjusted under 35 Prlmary ExammericamYT Truong U_S_C_ 154(1)) by 375 days_ (74) Attorney, Agent, or FirmiHunton & Williams LLP (21) Appl. No.: 10/357,447 (57) ABSTRACT (22) Filed: Feb. 4, 2003 Disclosed herein are Various exemplary systems and methods (51) 7/30 2006 01 for linking entity references to entities and identifying asso ( ' ) ciations between entities. In particular, a method for identi (52) US. Cl. ................. .. 707/5; 707/3; 707/6; 707/100; . . . . 707/102_ 707 / 4 fy1ng an entrty from a plurahty of ent1ty references, each (58) Field of Classi?cation Search ................ .. 707/3e6 enmydrjerjilce belllnihnked W1th; separatefghost enP‘Y’ 15 See application ?le for complete search history. prof/1 e ' T e met 0 Comprises? e Steps 0 @mpanng an ent1ty reference of a ?rst ghost ent1ty W1th an entrty reference (56) References Cited of a second ghost entity to determine a match probability U.S. PATENT DOCUMENTS 4,543,630 A 9/1985 Neches 4,860,201 A 8/1989 Stolfo et a1. 4,870,568 A 9/1989 Kahle etal. 4,925,311 A 5/1990 Neches et a1. 5,006,978 A 4/1991 Neches 5,276,899 A 1/1994 Neches 5,303,383 A 4/1994 Neches et a1. 5,423,037 A 6/1995 Hvasshovd 5,471,622 A 11/1995 Eadline 5,495,606 A 2/1996 Borden et a1. 5,551,027 A 8/1996 Choyetal. 5,555,404 A 9/1996 Torbjyamsen et al. 5,655,080 A 8/1997 Dias et a1. 5,732,400 A 3/1998 Mandler et a1. 5,745,746 A 4/1998 Jhingran et a1. 5,878,408 A 3/1999 Van Huben et al. (Continued) betWeen the entity reference of the ?rst ghost entity and the entity reference of the second ghost entity, linking the entity reference of the ?rst ghost entity additionally With the second ghost entity and the entity reference of the second ghost entity additionally With the ?rst ghost entity When the match prob ability is greater than or equal to a match threshold and repeating the steps of comparing and linking for one or more ghost entity pairings possible from the ghost entities. The method further comprises determining, for one or more entity references linked to a ghost entity, a score for the entity reference based at least in part on a match probability betWeen the entity reference and a Value representing the one or more entity references linked to the ghost entity and identifying the ghost entity as an actual entity based at least in part on one or more scores for the one or more entity references linked to the ghost entity. 13 Claims, 31 Drawing Sheets For each pem'enlev ?eld army 1,. delermine Iolsl number (Cowl!) 0! same ?eld amrlas in master me 592 count Table .522 For each particular ?eld envy f", determine neniexi wsIgM we" EM. l WE i : ——. Cmmr + CHMIIDILYMSI Context weigh: Table 525 celeuieie probability (P) of match bewveen Entry Relerenues using cumext weiemcsn 5.05 Asmgn DlDs tn Enmy Relelenoes beenn en pmhablhty (P) am

Transcript of Method and system for processing data records

Page 1: Method and system for processing data records

US007403942B1

(12) United States Patent (10) Patent N0.: US 7,403,942 B1 Bayliss (45) Date of Patent: Jul. 22, 2008

(54) METHOD AND SYSTEM FOR PROCESSING OTHER PUBLICATIONS DATA RECORDS

Eike Schallehn et al., “Advanced Grouping and Aggregation for Data (75) Inventor; David Bayliss, Delray Beach, FL (U S) Integration,” Department of Computer Science, Paper ID: 222, pp.

1-16.

(73) Assignee: Seisint, Inc., Boca Raton, FL (US) (Continued)

( * ) Notice: Subject to any disclaimer, the term of this _ _ patent is extended or adjusted under 35 Prlmary ExammericamYT Truong U_S_C_ 154(1)) by 375 days_ (74) Attorney, Agent, or FirmiHunton & Williams LLP

(21) Appl. No.: 10/357,447 (57) ABSTRACT

(22) Filed: Feb. 4, 2003 Disclosed herein are Various exemplary systems and methods

(51) 7/30 2006 01 for linking entity references to entities and identifying asso ( ' ) ciations between entities. In particular, a method for identi

(52) US. Cl. ................. .. 707/5; 707/3; 707/6; 707/100; . . . . 707/102_ 707 / 4 fy1ng an entrty from a plurahty of ent1ty references, each

(58) Field of Classi?cation Search ................ .. 707/3e6 enmydrjerjilce belllnihnked W1th; separatefghost enP‘Y’ 15 See application ?le for complete search history. prof/1 e ' T e met 0 Comprises? e Steps 0 @mpanng an

ent1ty reference of a ?rst ghost ent1ty W1th an entrty reference (56) References Cited of a second ghost entity to determine a match probability

U.S. PATENT DOCUMENTS

4,543,630 A 9/1985 Neches 4,860,201 A 8/1989 Stolfo et a1. 4,870,568 A 9/1989 Kahle etal. 4,925,311 A 5/1990 Neches et a1. 5,006,978 A 4/1991 Neches 5,276,899 A 1/1994 Neches 5,303,383 A 4/1994 Neches et a1. 5,423,037 A 6/1995 Hvasshovd 5,471,622 A 11/1995 Eadline 5,495,606 A 2/1996 Borden et a1. 5,551,027 A 8/1996 Choyetal. 5,555,404 A 9/1996 Torbjyamsen et al. 5,655,080 A 8/1997 Dias et a1. 5,732,400 A 3/1998 Mandler et a1. 5,745,746 A 4/1998 Jhingran et a1. 5,878,408 A 3/1999 Van Huben et al.

(Continued)

betWeen the entity reference of the ?rst ghost entity and the entity reference of the second ghost entity, linking the entity reference of the ?rst ghost entity additionally With the second ghost entity and the entity reference of the second ghost entity additionally With the ?rst ghost entity When the match prob ability is greater than or equal to a match threshold and repeating the steps of comparing and linking for one or more ghost entity pairings possible from the ghost entities. The method further comprises determining, for one or more entity references linked to a ghost entity, a score for the entity reference based at least in part on a match probability betWeen the entity reference and a Value representing the one or more

entity references linked to the ghost entity and identifying the ghost entity as an actual entity based at least in part on one or more scores for the one or more entity references linked to the

ghost entity.

13 Claims, 31 Drawing Sheets

For each pem'enlev ?eld army 1,. delermine Iolsl number (Cowl!) 0! same ?eld amrlas in master me

592

count Table .522

For each particular ?eld envy f", determine neniexi wsIgM we"

EM.

l WE i : ——.

Cmmr + CHMIIDILYMSI

Context weigh: Table 525

celeuieie probability (P) of match bewveen Entry Relerenues using

cumext weiemcsn 5.05

Asmgn DlDs tn Enmy Relelenoes beenn en pmhablhty (P)

am

Page 2: Method and system for processing data records

US 7,403,942 B1 Page 2

US. PATENT DOCUMENTS

5,884,299 A 3/1999 Ramesh et al. 5,897,638 A 4/1999 Lasser et a1. 5,983,228 A 11/1999 Kobayashi et al. 6,006,249 A 12/1999 Leong 6,026,394 A 2/2000 Tsuchida et a1. 6,081,801 A 6/2000 Cochrane et al. 6,266,804 B1 7/2001 Isman 6,311,169 B2 10/2001 Duhon 6,427,148 B1 7/2002 Cossock 6,658,412 B1* 12/2003 Jenkins et a1. ............... .. 707/5

OTHER PUBLICATIONS

Vincent Coppola, “Killer APP,” Men’s Journal, vol. 12, No. 3, Apr. 2003, pp. 86-90. Eike Schallehn et al., “Extensible and Similarity-based Grouping for Data Integration,” Department of Computer Science, pp. 1-17. Rohit Ananthakrishna et al., “Eliminating Fuzzy Duplicates in Data Warehouses,” 12 pages. Peter Christen et al., “Parallel Computing Techniques for High-Per formance Probabilistic Record Linkage,” Data Mining Group, Aus tralian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkagehtml, 2002, pp. 1-1 1 .

Peter Christen et al., “Parallel Techniques for High-Performance Record Linkage (Data Matching),” Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkage.html, 2002, pp. 1-27. Peter Christen et al., “High-Performance Computing Techniques for Record Linkage,” Data Mining Group, Australian National Univer sity, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkagehtml, 2002, pp. 1-14. William E. Winkler, “Matching And Record Linkage,” US. Bureau ofthe Census, pp. 1-38. Peter Christen et al., “High-Performance Computing Techniques for Record Linkage,” ANU Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkage.html, pp. 1-11. William E. Winkler, “The State of Record Linkage and Current Research Problems,” US. Bureau of the Census, 15 pages. William E. Winkler, “Advanced Methods For Record Linkage,” Bureau ofthe Census, pp. 1-21. William E. Winkler, Frequency-Based Matching in Fellegi-Sunter Model of Record Linkage, Bureau Of The Census Statistical Research Division, Oct. 4, 2000, 14 pages.

William E. Winkler, “State of Statistical Data Editing And Current Research Problems,” Bureau Of The Census Statistical Research Division, 10 pages. The First Open ETL/EAI Software For The Real-Time Enterprise, Sunopsis, A New Generation ETL Tool, “SunopsisTM v3 expedites integration between heterogeneous systems for Data Warehouse, Data Mining, Business Intelligence, and OLAP projects,” <www. suopsis.com>, 6 pages. Alan Dumas, “The ETL Market and SunopsisTM v3 Business Intel ligence, Data Warehouse & Datamart Projects,” 2002, Sunopsis, pp. 1-7. Teradata Warehouse Solutions, “Teradata Database Technical Over view,” 2002, pp. 1-7. WhiteCross White Paper, May 25, 2000, “wx/des-Technical Infor mation,” pp. 1-36. Teradata Alliance Solutions, “Teradata and Ab Initio,” pp. 1-2. Peter Christen et al., The Australian National University, “FebrliFreely extensible biomedical record linkage,” Oct. 2002, pp. 1-67. William E. Winkler, “Using the EM Algorithim for Weight Compu tation in the Fellegi-Sunter Model of Record Linkage,” Bureau Of The Census Statistical Research Division, Oct. 4, 2000, 12 pages. William E. Winkler et al., “An Application Of The Fellegi-Sunter Model Of Record Linkage To The 1990 US. Decennial Census,” US. Bureau of the Census, pp. 1-22. William E. Winkler, “Improved Decision Rules In The Fellegi-Sunter Model Of Record Linkage,” Bureau of the Census, pp. 1-13. Fritz Scheuren et al., “Recursive Merging and Analysis of Adminis trative Lists and Data,” US. Bureau of the Census, 9 pages. William E. Winkler, “Record Linkage Software and Methods for Merging Administrative Lists,” US. Bureau of the Census, Jul. 7, 2001, 11 pages. Enterprises, Publishing and Broadcasting Limited, Acxiom-Abilitec, pp. 44-45. TransUnion, Credit Reporting System, Oct. 9, 2002, 4 pages, <http:// www.transunion.com/content/pagejsp?id:/transunion/general/ data/business/BusCre...>. TransUnion, ID Veri?cation & Fraud Detection, Account Acquisi tion, Account Management, Collection & Location Services, Employment Screening, Risk Management, Automotive, Banking Savings & Loan, Credit Card Providers, Credit Unions, Energy & Utilities, Healthcare, Insurance, Investment, Real Estate, Telecom munications, Oct. 9, 2002, 46 pages, <http://www.transunion.com>. White Paper An Introduction to OLAP Multidimensional Terminol ogy and Technology, 20 pages.

* cited by examiner

Page 3: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 1 0f 31 US 7,403,942 B1

Page 4: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 2 0f 31 US 7,403,942 B1

Fig. 1B

146 144

Page 5: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 3 0f 31 US 7,403,942 B1

Fig. 2

Prepare Raw Data ‘ (Preparation Phase)

102

M O O

V

Repeat for Iteration N Translate Data to Entity References lncomlng Data —> Ll (Link Phase)

.225

v

Determine Inter-Relationships Between Entities

(Association Phase) 2_0§

Perform One or More Queries Using Master

File

Page 6: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 4 0f 31 US 7,403,942 B1

Format Raw Data into ‘——-—> Entity References

QQZ

- Join Entity References Preparatlon Phase (Master File)

-—-2O2 304

l Remove Duplicate Entity

References Repeat for Iteration N §Q_6_

& l

Fill In Null Field Values £18.

l Remove Junk Field

Values/Entries 31_0

Incoming Data ——-—-—>

Page 7: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 5 0f 31 US 7,403,942 B1

Select Relevant Fields ——-—-—->

Q

Y

Measure Field Variance and Reset DlDs If

Necessary M

Link Phase l 2_04_ Fill In Null Field Values

4_0§

l . Repeat for iteration N Generate Ghost Entity

Incoming Data —> m References

m

Link Entity References M

l Transition Links

£2

l Append/Modify DlDs in

Master File 41_4

Page 8: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 6 0f 31 US 7,403,942 B1

Fig. 5

Probability-Based Matching 5_0A

Content Weighting Field Weighting

Entity Reference A ‘y

Indication of a Compare Entity References ‘ Link Between

Q _

Entity Reference B Entity References

\

Context @

Ethnicity

Familial Nicknames/ Location Relationships Synonyms

Page 9: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 7 0f 31 US 7,403,942 B1

For each particular ?eld entry f“, determine total number (Count) of

I same ?eld entries in master file

F l g . 6 502. Count I i (f, = f") then [,else 0]

Count Table 522

For each particular ?eld entry fn, determine context weight we‘i

.60_4

l w : m

C .

Count + Cautrousness

Context Weight ‘Fable

Calculate probability (P) of match between Entry References using

context weight(s) egg

P(erl : erZ) : ZFUWCJ * fi

i Assign DlDs to Entity References

based on probability (P) ?tl?i

Page 10: Method and system for processing data records

US. Patent

Select subset N of Entity Reference fields

Q2

l Next Entity Ref. A and

Entity Ref. B M

Jul. 22, 2008 Sheet 8 0f 31 US 7,403,942 B1

‘1 O

For each ?eld (X) of the subset: E A

Compare No .

Match A.fX with Bfx am

Match

l Add (A,B) to Match Table

M

Match Table m

Common DID transition using Match Table m

V

Adjust DlD of Affected Entity References in Master File

_7_1_4

Page 11: Method and system for processing data records

US. Patent

806

Jul. 22, 2008 Sheet 9 0f 31 US 7,403,942 B1

F | g. 8

808

u

804 802

Page 12: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 10 0f 31 US 7,403,942 B1

V

Page 13: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 11 0f 31

Fig. 10

Match Table m9.

lnner Join of Match Table with itself by left DID

1002

Expanded Match Table

1022

Inner Join of Expanded Match Table with itself from

right DID to left DID 1M

Transitive Closure Table 1024

Transition DlDs to lowest possible DID value

1006

US 7,403,942 B1

Page 14: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 12 0f 31

Fig. 11

US 7,403,942 B1

Select subset N of data ?elds

1 102

i For each ?eld (X) of the subset, generate Field Unique Value Table

11_0i

l Cross-Produce Field

Unique Value Tables to generate Ghost Table w

Ghost Table 1128

Update Master File to Include Ghost Entity

References 1125i

Page 15: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 13 0f 31 US 7,403,942 B1

Page 16: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 14 0f 31 US 7,403,942 B1

Measure variance along each ?eld ‘axis’

1302 Fig. 13

Variance >

Threshold? 1304

Yes V

Reset DID of Each Entity Reference to its RID

1%

1 End 1i

Mark Entity References as having been ‘Broken’

1308

O ne or more

Entity Ref. Broken

Mark Entity References as suspect

1314

Page 17: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 15 0f 31 US 7,403,942 B1

Fig. 14 Determine Degree of

Commonality ———> (Association) Between

Entities HQZ

Association Phase M

Mark Highly Associated Entities as Related

1404

Incoming Data Repeat fogltgration N

‘ + Generate Ghost Entity

References from Relations HE

Transitive Closure For Additional Associations

Between Entities M

Page 18: Method and system for processing data records

US. Patent Jul. 22,2008

Select subset N of Entity Reference ?elds

1502

i

Sheet 16 0f 31

Fig. 15

ls Entity C not Associated

Next Entity Ref. A and i For each ?eld (X) of the Entity Ref. B ‘ subset:

1504 I 1506

A

Entity Ref, B

Entity Ref. A

Compare Afx with Bfx

1505

No Match

Match

i Increase score of (CD)

pair in Score Table 1510

Score Table @2.

Score(C,D) with Entity o, 4—N° >=Threshold

Entity D not Associated 1512 with Entity C

Yes

Mark Entity C as Associate of Entity D,

Entity C 1514

Entity D as Associate of

US 7,403,942 B1

Page 19: Method and system for processing data records

US. Patent Jul. 22, 2008 Sheet 17 0f 31

Fig. 16

Relatives File 1620

Filter 1602

l Duplicate Records

1604

1 Inner Join by left DID

1606

i Set weight, separation,

and dedup values 1608

US 7,403,942 B1

Page 20: Method and system for processing data records

US. Patent

Match Table 1730

Jul. 22, 2008 Sheet 18 0f 31

Filter 1702 Fig. 17

l Duplicate Records

1704

Duplicate Match Table

1722

Inner Join duplicate match table with master

?le 1M

Outlier Reference Table 1724

Score DlDs using grading criteria

1708 <— Grading Criteria

V

Sum DID scores 1710

DID Score Table 1726

US 7,403,942 B1

Filter DID Score Table m

Obtain entity references of selected DlDs from

Outlier Reference Table w

Page 21: Method and system for processing data records
Page 22: Method and system for processing data records
Page 23: Method and system for processing data records
Page 24: Method and system for processing data records
Page 25: Method and system for processing data records
Page 26: Method and system for processing data records
Page 27: Method and system for processing data records
Page 28: Method and system for processing data records
Page 29: Method and system for processing data records
Page 30: Method and system for processing data records
Page 31: Method and system for processing data records
Page 32: Method and system for processing data records
Page 33: Method and system for processing data records
Page 34: Method and system for processing data records
Page 35: Method and system for processing data records
Page 36: Method and system for processing data records
Page 37: Method and system for processing data records
Page 38: Method and system for processing data records
Page 39: Method and system for processing data records
Page 40: Method and system for processing data records
Page 41: Method and system for processing data records
Page 42: Method and system for processing data records
Page 43: Method and system for processing data records
Page 44: Method and system for processing data records
Page 45: Method and system for processing data records
Page 46: Method and system for processing data records
Page 47: Method and system for processing data records
Page 48: Method and system for processing data records
Page 49: Method and system for processing data records
Page 50: Method and system for processing data records
Page 51: Method and system for processing data records
Page 52: Method and system for processing data records
Page 53: Method and system for processing data records
Page 54: Method and system for processing data records
Page 55: Method and system for processing data records
Page 56: Method and system for processing data records
Page 57: Method and system for processing data records
Page 58: Method and system for processing data records
Page 59: Method and system for processing data records
Page 60: Method and system for processing data records
Page 61: Method and system for processing data records
Page 62: Method and system for processing data records
Page 63: Method and system for processing data records
Page 64: Method and system for processing data records
Page 65: Method and system for processing data records
Page 66: Method and system for processing data records
Page 67: Method and system for processing data records
Page 68: Method and system for processing data records
Page 69: Method and system for processing data records
Page 70: Method and system for processing data records
Page 71: Method and system for processing data records
Page 72: Method and system for processing data records