David Inouye Georgia Institute of Technology 2011 DIMACS REU Intern at Rutgers University William M....

15
Entity Resolution David Inouye Georgia Institute of Technology 2011 DIMACS REU Intern at Rutgers University William M. Pottenger, Ph.D., Mentor * The content of this presentation has been adapted from a presentation given by Nir Grinberg. 06/07/2011 1

Transcript of David Inouye Georgia Institute of Technology 2011 DIMACS REU Intern at Rutgers University William M....

Entity Resolution

David InouyeGeorgia Institute of Technology

2011 DIMACS REU Intern at Rutgers UniversityWilliam M. Pottenger, Ph.D., Mentor

* The content of this presentation has been adapted from a presentation given by Nir Grinberg.

06/07/2011 1

Introduction to Entity ResolutionEntity resolution is the problem of deciding if

two sets of data elements refer to the same real-world entity.

06/07/2011 2

Elements from Source 1 Elements from Source 2

? ?

?

Introduction to Entity ResolutionEntity resolution is the problem of deciding if

two sets of data elements refer to the same real-world entity.

06/07/2011 3

Elements from Source 1 Elements from Source 2

Objective/Approach

06/07/2011 4

Standardize and Encode

Calculate Similarity

Scores

Classify Using

Ground Truth Data

* WITS - https://wits.nctc.gov/; GTD - http://www.start.umd.edu/gtd/

Incidents in GTD*

Incidents in WITS*

Month: 6 Day: 28Year: 2005City: Dardsun, KupwaraType: Arson

Date: 06/27/2005City: KupwaraType: Fire attack

Phase 1: Standardize and Encode

06/07/2011 5

WITS Incident_ID

Date City State_Prov Country

40426 12/3/06 Udhampur Jammu and Kashmir

India

15649 6/27/2005 Kupwara Jammu and Kashmir

India

GTD Eventid

Iyear Imonth Iday City Provstate

country

200404140003

2004 4 14 Patna Bihar India

200506280004

2005 6 28 Dardsun Kupwara

Jammu & Kashmir (State)

India

Phase 1: Standardize and EncodeStandardize

DatesMap WITS weapon types to GTD weapon types

GeoCode location to latitude and longitude

Extract topic model distribution using LDA

06/07/2011 6

Phase 1: Latent Dirichlet AllocationGenerative

probabilistic modelAssumes topics are

probability distributions of words

Assumes documents are probability distributions of topics

06/07/2011 7

Topic 1 Topic 20

0.1

0.2

0.3

0.4

0.5

MoneyLoanBankRiverStream

Doc 1 Doc 2 Doc 30

0.20.40.60.8

1

Topic 1Topic 2

Phase 1: LDA Example

06/07/2011 8* Example from “Probabilistic Topic Models” by Mark Steyvers.http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf

Phase 1: Latent “Topics” (most probable words in topic)

killed, kashmir, attack, injured, militants, suspected, blast, kill, bombfired, upon, armed, killed, manipur, civilian, imphal, member, formercivilian, kashmir, jammu, night, residence, kidnapped, one, village, dodapolice, one, killing, wounding, officers, two, officer, others, injuringjammu, kashmir, baramula, security, one, armed, anantnag, hizbul, mujahedinassam, explosive, front, improvised, device, liberation, united, ied, ulfawidely, two, civilians, national, tripura, kidnapped, three, village, karbicausing, injuries, damage, damaging, fire, station, set, detonated, trainmaoist, party, communist, cpi, widely, pradesh, andhra, chhattisgarh, villagegrenade, threw, civilians, wounding, srinagar, vehicle, two, kashmir, jammu

06/07/2011 9

Phase 1: Latent “Topics” (most probable words in topic)

killed, kashmir, attack, injured, militants, suspected, blast, kill, bombfired, upon, armed, killed, manipur, civilian, imphal, member, formercivilian, kashmir, jammu, night, residence, kidnapped, one, village, dodapolice, one, killing, wounding, officers, two, officer, others, injuringjammu, kashmir, baramula, security, one, armed, anantnag, hizbul, mujahedinassam, explosive, front, improvised, device, liberation, united, ied, ulfawidely, two, civilians, national, tripura, kidnapped, three, village, karbicausing, injuries, damage, damaging, fire, station, set, detonated, trainmaoist, party, communist, cpi, widely, pradesh, andhra, chhattisgarh, villagegrenade, threw, civilians, wounding, srinagar, vehicle, two, kashmir, jammu

06/07/2011 10

Phase 2: Compute SimilarityDates

05/23/2001 vs. 05/22/2001Nominal strings such as country or city

“Jammu” vs. “Jammuu”GeoLocation

Lat 32.8/Long 74.7 vs. Lat 32.27/Long 75.6Topic distribution

06/07/2011 11Topic

1Topic

2Topic

3Topic

4

0

0.4

Topic 1

Topic 2

Topic 3

Topic 4

0

0.4

Phase 3: Classify as Match/Non-match

06/07/2011 12* The Center for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland provided the human annotated ground truth data.

Similarity Scores

Classifier

Model Based on Ground

Truth*

Match or Non-

match

Phase 3: Classifier Results

06/07/2011 13

ClassifiedNon-match Match

ClassNon-match 9875 511

Match 116 246

Accuracy Precision Recall0

0.20.40.60.8

10.94

0.32

0.68

My research possibilitiesClean up the ground truth data

Improve upon the HO-LDA algorithm

Consider how to compute different similarity scores

06/07/2011 14

Q&A

Thank you!

06/07/2011 15