A framework of identity resolution: evaluating identity ... · rule-based and machine learning...

12
RESEARCH Open Access A framework of identity resolution: evaluating identity attributes and matching algorithms Jiexun Li 1 and Alan G. Wang 2* Abstract Duplicate and false identity records are quite common in identity management systems due to unintentional errors or intentional deceptions. Identity resolution is to uncover identity records that are co-referent to the same real-world individual. In this paper we introduce a framework of identity resolution that covers different identity attributes and matching algorithms. Guided by social identity theories, we define three types of identity cues, namely personal identity attributes, social behavior attributes, and social relationship attributes. We also compare three matching algorithms: pair-wise comparison, transitive closure, and collective clustering. Our experiments using synthetic and real-world data demonstrate the importance of social behavior and relationship attributes for identity resolution. In particular, a collective identity resolution technique, which captures all three types of identity attributes and makes matching decisions on identities collectively, is shown to achieve the best performance among all approaches. Keywords: Identity resolution; Social behavior; Social relationship; Collective clustering Introduction The world is moving away from paper-based documents to electronic records. Due to the ease of generating iden- tity records and lack of sufficient verification or valid- ation during data entry processes, duplicate and false identity records become quite common in electronic sys- tems. In many practices of identity management, espe- cially those that require integrating multiple data sources, it is often inevitable and tedious to deal with the problem of identity duplication. Particularly, finding an effective solution to this problem is extremely critical in fighting crime and terrorism to enforce national se- curity. Criminals and terrorists often assume fake iden- tities using either fraudulent or legitimate means so as to hide their true identity. In a number of cases docu- mented by government reports, terrorists in different countries are known to commit identity crimes, such as falsifying passports and baptismal certificates, to facili- tate their financial operations and execution of attacks, either in the real world or in the cyber space [1, 2]. The problem of an individual having multiple identities can easily mislead intelligence and law enforcement investigators [3]. Therefore, to determine whether an individual is who they claim to be is essential in the mission of counterterrorism to identify potential ter- rorists and prevent terrorism acts from occurring [4]. Identity resolution is a process of semantic reconcili- ation that determines whether a single identity is the same when being described differently [5]. The goal of resolution is to detect duplicate identity records that refer to the same individual in the real world. Over the years, researchers in the areas of database and statistics have proposed many different techniques to tackle this problem. Traditional resolution techniques rely on key attributes such as identification numbers, names and date-of-birth to detect matches. These attributes are commonly used for they are simple describers of an in- dividual and often available in most record management systems [6, 7]. However, such personal identity attributes also vary in terms of availability and reliability across dif- ferent systems. These attributes are not always accurate due to various reasons such as unintentional entry errors and intentional deception [6]. In the context of cyberse- cruity, it is much easier and more common for criminals to fake identities to cover their traces. In order to im- prove resolution accuracy, several recently proposed resolution techniques have taken into account social * Correspondence: [email protected] 2 Pamplin College of Business, Virginia Tech, Blacksburg, VA 24061, USA Full list of author information is available at the end of the article © 2015 Li and Wang. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. Li and Wang Security Informatics (2015) 4:6 DOI 10.1186/s13388-015-0021-0

Transcript of A framework of identity resolution: evaluating identity ... · rule-based and machine learning...

Page 1: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Li and Wang Security Informatics (2015) 4:6 DOI 10.1186/s13388-015-0021-0

RESEARCH Open Access

A framework of identity resolution:evaluating identity attributes and matchingalgorithms

Jiexun Li1 and Alan G. Wang2*

Abstract

Duplicate and false identity records are quite common in identity management systems due to unintentional errors orintentional deceptions. Identity resolution is to uncover identity records that are co-referent to the same real-worldindividual. In this paper we introduce a framework of identity resolution that covers different identity attributes andmatching algorithms. Guided by social identity theories, we define three types of identity cues, namely personalidentity attributes, social behavior attributes, and social relationship attributes. We also compare three matchingalgorithms: pair-wise comparison, transitive closure, and collective clustering. Our experiments using synthetic andreal-world data demonstrate the importance of social behavior and relationship attributes for identity resolution. Inparticular, a collective identity resolution technique, which captures all three types of identity attributes and makesmatching decisions on identities collectively, is shown to achieve the best performance among all approaches.

Keywords: Identity resolution; Social behavior; Social relationship; Collective clustering

IntroductionThe world is moving away from paper-based documentsto electronic records. Due to the ease of generating iden-tity records and lack of sufficient verification or valid-ation during data entry processes, duplicate and falseidentity records become quite common in electronic sys-tems. In many practices of identity management, espe-cially those that require integrating multiple datasources, it is often inevitable and tedious to deal withthe problem of identity duplication. Particularly, findingan effective solution to this problem is extremely criticalin fighting crime and terrorism to enforce national se-curity. Criminals and terrorists often assume fake iden-tities using either fraudulent or legitimate means so asto hide their true identity. In a number of cases docu-mented by government reports, terrorists in differentcountries are known to commit identity crimes, such asfalsifying passports and baptismal certificates, to facili-tate their financial operations and execution of attacks,either in the real world or in the cyber space [1, 2]. Theproblem of an individual having multiple identitiescan easily mislead intelligence and law enforcement

* Correspondence: [email protected] College of Business, Virginia Tech, Blacksburg, VA 24061, USAFull list of author information is available at the end of the article

© 2015 Li and Wang. This is an Open Access a(http://creativecommons.org/licenses/by/4.0), wprovided the original work is properly credited

investigators [3]. Therefore, to determine whether anindividual is who they claim to be is essential in themission of counterterrorism to identify potential ter-rorists and prevent terrorism acts from occurring [4].Identity resolution is a process of semantic reconcili-

ation that determines whether a single identity is thesame when being described differently [5]. The goal ofresolution is to detect duplicate identity records thatrefer to the same individual in the real world. Over theyears, researchers in the areas of database and statisticshave proposed many different techniques to tackle thisproblem. Traditional resolution techniques rely on keyattributes such as identification numbers, names anddate-of-birth to detect matches. These attributes arecommonly used for they are simple describers of an in-dividual and often available in most record managementsystems [6, 7]. However, such personal identity attributesalso vary in terms of availability and reliability across dif-ferent systems. These attributes are not always accuratedue to various reasons such as unintentional entry errorsand intentional deception [6]. In the context of cyberse-cruity, it is much easier and more common for criminalsto fake identities to cover their traces. In order to im-prove resolution accuracy, several recently proposedresolution techniques have taken into account social

rticle distributed under the terms of the Creative Commons Attribution Licensehich permits unrestricted use, distribution, and reproduction in any medium,.

Page 2: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Li and Wang Security Informatics (2015) 4:6 Page 2 of 12

contextual information, in addition to traditionally useddescriptive identity attributes, as new evidence for iden-tity matching [5, 8, 9]. Social contextual information,such as employment history, credit history and friend-ship networks, has shown to be effective in identity reso-lution because it is very difficult to be manipulated [10].A review of related work leads us to believe that, al-

though a variety of identity resolution techniques havebeen developed, there is not a comprehensive frameworkthat covers various types of identity attributes andmatching algorithms. This leads to the main goal of ourstudy. In this paper, we provide an overview of variousidentity attributes that are useful to computational iden-tity resolution. We develop an identity resolution frame-work that comprehensively considers both profile-basedpersonal identity attributes and social identity attributes,especially with a focus on the latter. Our framework alsoconsiders different computational algorithms for match-ing identities, ranging from simple pairwise matching tocollective clustering. Compared to existing identity reso-lution research, our proposed framework has the follow-ing advantages. First, the framework is generic andapplicable to most record management system withoutintroducing an overhaul in the database schema. Second,the integration of both personal and social attributesprovides complementary and convincing evidence foridentity resolution. Third, the collective resolution tech-nique makes joint decisions on matching identities andtherefore leads to more accurate results.The remainder of the paper is organized as follows.

We first review identity theories and existing resolutiontechniques. Next, we introduce our proposed identityresolution framework including several matching tech-niques based on various attributes and algorithms. Wereport a comparative study of several resolution tech-niques in the framework using a synthetic dataset and areal-world dataset. Finally, we conclude the paper with asummary and a discussion on future research directions.

Literature reviewIn this section, we review existing identity resolution ap-proaches, with a focus on the identity attributes andmatching techniques adopted.

Entity resolution and identity resolutionIdentity resolution is a special type of entity reso-lution that specializes in identity management. Entityresolution is also known as record linkage and dedu-plication in the areas of statistics and database man-agement. Record linkage, originated in the statisticscommunity, is used to identify those records in one ormultiple datasets that refer to the same real-world en-tity [11]. The very same task is often called and stud-ied as record deduplication in database and artificial

intelligence communities [12–14]. Given a number ofrecords that are comprised of multiple fields, thesetechniques determine whether two records match bycomparing the values in corresponding fields. A goodsurvey on individual field matching models for recorddeduplication can be found in [15].Entity resolution techniques can be extended to differ-

ent contexts, e.g., identity resolution in identity manage-ment. However, as a special type of entity resolution,identity resolution can be very complex due to the spe-cial data characteristics of identity records. First, identityresolution, especially in the intelligence and law enforce-ment communities, often suffers greatly from the miss-ing data problem [7]. Missing values, if present in manyfields of a record, can present a big challenge for identityresolution techniques. Second, identity resolution needsto handle not only duplicates caused by entry errors ordata ambiguities but also intentional identity fraud anddeception, which tend to be hidden and concealed.Third, identity resolution techniques may need to be ad-justable to different evaluation criteria. For instance,false positives may be less tolerable than false negativesfor identity authentication that grants access to a criticalfacility. In contrast, a high false positive rate may not bea big concern when a detective searches for recordsrelated to a crime suspect with limited information.Therefore, accurate identity resolution requires a carefuldesign that considers the special characteristics of iden-tity records.

Identity attributesBased on the identity theories from the social science lit-erature, an individual’s identity is considered to have twobasic components, namely a personal identity and a so-cial identity. A personal identity is one’s self-perceptionas an individual, whereas a social identity is one’s bio-graphical history that builds up over time [16]. In par-ticular, one’s personal identity may include personalinformation given at birth (e.g., name, date and place ofbirth), personal identifiers (e.g., social security number),physical descriptions (e.g., height, weight), and biometricinformation (e.g., fingerprint, DNA). In contrast, a socialidentity is concerned with one’s existence in a socialcontext. Social identity theories consist of psychologicaland sociological views. The psychological view defines asocial identity as one’s self-perception as a member of cer-tain social groups such as nation, culture, gender identifi-cation, and employment [17, 18]. The sociological viewfocuses on “the relationships between social actors whoperform mutually complementary roles (e.g., employer-employee, doctor-patient)” [19]. While psychological viewdeals with large-scale groups, the sociological view em-phasizes the role-based interpersonal relationships amongpeople [20]. These two views combined together provide a

Page 3: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Li and Wang Security Informatics (2015) 4:6 Page 3 of 12

more complete concept for understanding a social identityat levels of social context.Traditional identity resolution methods primarily rely

on personal identity attributes such as name, gender,date of birth, and identification numbers mostly becausethey are commonly available as identifiers in recordmanagement systems. These attributes, however, maysuffer from data quality problems such as uninten-tional errors [21], intentional deception [6], and miss-ing data [7]. Biometric features such as fingerprintsand DNA also belong to the category of personal at-tributes. Although they are considered as more reli-able, they are not available or accessible due to issuessuch as high costs and confidentiality in most sys-tems. A study conducted by the United KingdomHome office [22] suggests that identity crimes usuallyinvolve the illegal use or alteration of those personalidentity components. The low data quality in fact-based personal attributes can severely jeopardize theperformance of identity resolution [7].Individuals are not isolated but interconnected to each

another in a society. The social context associated withan individual can be clues that reveal his or her undeni-able identity. Recognizing the limitations of personal at-tributes, many recent studies have started exploitingsocial context information such as social behaviorsand relationships for identity resolution. For example,Ananthakrishna et al. [23] introduced a method thateliminates duplicates in data warehouses using a dimen-sional hierarchy (e.g., city-state-country) over the link rela-tions. This method scales up the ability of the matchingtechnique by only comparing attribute values that havethe same foreign key dependency. For example, the simi-larity of two identity records will be computed only whenthey both live in the same city. Kalashnikov et al. [24] in-corporated co-affiliation and co-authorship relationshipsinto an resolution model for reference disambiguation.Köpcke and Rahm [25] categorized entity resolutionapproaches into attribute value matchers and contextmatchers. While value matchers solely rely on descriptiveattributes, context matchers consider information inferredfrom social interactions represented as linkages in agraph.

Resolution techniquesWe can categorize existing resolution techniques intorule-based and machine learning approaches. There havebeen several rule-based identity resolution approachesbased on matching rules encoded by domain experts.For instance, to integrate cross-jurisdictional criminalrecords, a simple rule can be: two identity records matchonly if their first name, last name, and date-of-birth(DOB) values are all identical [26]. Such exact-matchheuristics tend to have high specificity but low sensitivity

in detecting true matches, especially when data qualityproblems such as missing values, entry errors and de-ceptions are present. Hence, a good resolution techniquemust support partial-match as well to reduce false nega-tives. IBM’s InfoSphere Identity Insight is a leading com-mercial software platform for entity resolution andanalysis. It provides an identity analytics solution with aset of rules predefined by human experts as well as so-phisticated algorithms. For example, given two identityrecords with identical dates of birth and last names, thesystem will resolve them into one if the matching scoreof their first names is above a threshold. The mostcritical asset and the biggest challenge of a rule-basedsystem is the creation of the rule set. Creating a com-pressive rule sets can be highly time-consuming and ex-pensive. Furthermore, rules could be domain-dependentand not portable across different contexts.As an alternative to manual rule coding, machine

learning can automatically extract patterns from anno-tated training data with annotated matching pairs andbuild resolution models for new identity records. Givena pair of identity records, distance/similarity measuresare defined for different descriptive attributes and thencombined into an overall score. If the overall distance(or similarity) score is below (or above) a pre-definedthreshold, then the pair should be regarded as a match.Brown and Hagen [27] proposed a data associationmethod for linking criminal records that possibly referto the same suspect. This method compares two recordsand calculates a total similarity measure as a weighted-sum of the similarity measures of all corresponding fea-ture values. Similarly, Wang et al. [6] proposed a recordlinkage algorithm for detecting deceptive identities bycomparing four personal attributes (name, DOB, socialsecurity number, and address) and combining them intoan overall similarity score. A supervised learning processdetermines a threshold for match decisions using a setof identity pairs labeled by an expert. However, thesemethods based on a limited number of descriptive attri-butes tend to fail if one or more of these attributes con-tains missing values [7]. Rather than simply labeling anidentity pair as match or non-match, probabilisticmethods for identity resolution root in the seminal workof [11]. By posing record linkage as a probabilistic classi-fication problem, they propose a formal framework topredict the likelihood of matching identity pairs basedon the agreement among attributes. Assuming condi-tional independence among features given the matchclass, the framework estimates the probabilistic parame-ters of the record linkage model in an unsupervisedfashion. Built upon this work, many later studies followand extend this framework by enriching the probabilis-tic model [7, 28–30]. These studies have shown thatthe probabilistic models achieve good performance for

Page 4: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Li and Wang Security Informatics (2015) 4:6 Page 4 of 12

identity matching. However, the parameters of the prob-abilistic models may not be accurately estimated in theabsence of sufficient training data [31].More machine learning based techniques have been

proposed for the more general problem of entity reso-lution. Culotta and McCallum [32] constructed a condi-tional random field model (CRF) for record deduplicationthat captures inter-dependencies between different typesof entities. This method, however, fails to model the expli-cit links among the same type of entities. Pasula et al. [9]proposed a citation matching approach based on probabil-istic relational model (PRM), which is built upon depen-dences among entities in a relational database schemathrough foreign key relationships. Li et al. [3] developed asystematic approach to deriving social behavior and socialrelationship features for identity matching based on adatabase schema and PRMs. This approach can be used asa plug-and-play solution for most relational database-based record management systems. For less structuredcollections of record data (e.g., criminal reports, news arti-cles), a lot of efforts for information extraction and trans-formation will be needed. Bhattacharya and Getoor [33]proposed a graph-based method for entity resolution. Itdefines a similarity measure that combines corre-sponding attribute similarities with graph-based rela-tional similarity between each entity reference pair.Furthermore, Bhattacharya and Getoor [8] extendedtheir relational resolution approach and introduced acollective entity resolution algorithm. Rather thansimply making pair-wise entity comparisons, theirmethod can derive new social information from deci-sions already made and incorporate it into furtherresolution process iteratively. More recently, in orderto tackle the problem of one individual having severalprofiles on different social media platforms (e.g., Face-book and Twitter), researchers have developed tech-niques for matching user profiles. Bartunov et al. [34]developed a CRF-based approach that combines twouser graphs using both user profile attributes and so-cial linkages. All these studies have demonstratedthat social information, when incorporated in match-ing algorithms, can improve the performance foridentity resolution.

Research gapAccording to our literature review, many researchershave recognized the limitations of traditional identityresolution techniques that solely rely on personal iden-tity attributes such as name and DOB. Although theseprofile-based identity attributes are available in mostrecord management systems, they are subject to dataentry errors, deception, and fraud [6]. Hence, tech-niques based on such attributes would fail drastically inrecognizing the unconventional truth like the synonym

between Osama bin Laden and The Prince [35]. Severalrecent studies have demonstrated the benefits of util-izing social contextual information in identity reso-lution [3, 36]. However, most studies lack the guidanceof identity theories to construct and examine differenttypes of social attributes for identity resolution. Identitytheories suggest that personal and social identity maycomplement each other for the purpose of identityresolution. The social aspect of identity reflects one’spsychological and sociological perception of his/herown identity. Social identity attributes might be morereliable than personal profiles in that they cannot beeasily altered or falsified by an individual. Furthermore,existing resolution techniques mainly employ pair-wisecomparison [6, 27, 37, 38] when finding matching iden-tity records. Entity resolution studies have shown thatresolution accuracy can improve significantly if match-ing of related identity references are performed in a col-lective fashion [8]. The effectiveness of a collectiveapproach in the context of identity resolution is yet tobe examined. Built upon our previous work in [39], thisstudy is to develop a comprehensive identity resolutionframework by examining a variety of identity attributesand evaluating different matching strategies.

A framework of identity resolutionIn this study, we aim to contribute to the field of identityresolution from the following three aspects:

� Developing a comprehensive framework of identityresolution that covers both the usage of identityattributes and matching algorithms;

� Examining the predictive powers of personal identityattributes and social identity features for resolution;and

� Evaluating different matching algorithms for identityresolution.

Problem definitionWe define the identity resolution problem as follows.Given a set of identity references, R = {ri}, where eachreference r is described by a set of attributes r.A1, r.A2, …,r.Al. These references correspond to a set of unknown in-dividual E = {ei}. Due to certain reasons (e.g., entry errors,duplicates or deceptions), multiple references may be co-referent to the same underlying individual e. We use r.E torefer to the individual to which reference r refers to. Eachreference r may have participated in some incident(s) (e.g.,a financial transaction, a criminal misconduct, a terroristattack) and each incident may involve one or multiple ref-erences. We use a set of hyper-edges H = {hi} to repre-sent all incidents. Each incident is a hyper-edge thatconnects multiple references. Each hyper-edge h canalso be described by a set of attributes h.B1, h.B2, …,

Page 5: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Li and Wang Security Informatics (2015) 4:6 Page 5 of 12

h.Bm. We use h.R to denote the set of references in-volved in h. We use r.H to denote the set of hyper-edges in which r participates. In addition, the partici-pation of r in h can be described by a set of attri-butes r.h.C1, r.h.C2, …, r.h.Cn. The objective is touncover the hidden set of individuals E = {ei} and theentity labels r.E for all references given the observa-tions of the references and their involved hyper-edges.Figure 1 illustrates the problem of identity reso-

lution in a graph of five reference nodes connectedby three hyper-edges. In this graph, two nodes (refer-ences) r1 and r2 are in fact co-referent to the sameperson; r4 and r5 are co-referent to another person.Our goal is to uncover the underlying matches fromthe graph. In addition to simply comparing the node-level attributes of references, we should also considerthe attributes related to their involved incidents, rep-resented by hyper-edges. For example, if h1 and h2share some similar patterns (e.g., incident type, timeof day), or if r1 and r2 have similar patterns (e.g., role,activeness) in their participation in h1 and h2 respect-ively, these should be considered as supporting evi-dence for r1 and r2 being co-referent to the sameindividual. Furthermore, Moreover, the commonneighbor r3 shared by r1 and r2 can also indicate po-tential linkage between the pair. Furthermore, once r1and r2 are determined to be co-referent, the joiningof these two can bring in new contextual informationfor inferring the matching relationship between otheridentity references (e.g., r4 and r5). We will discusshow to mathematically represent and compute such

Fig. 1 A graphic view of the identity resolution problem

evidence in the rest of Section “A framework of iden-tity resolution”.To solve this problem, we develop a framework of

identity resolution by considering two aspects: identityattributes and matching algorithms. Identity attributesare features that can be used to describe certain distinct-ive characteristics of identity references. A matchingalgorithm defines the computational procedure ofrecovering the matching entities from the given set ofreferences. In this section we first define various types ofidentity attributes that can be used in identity resolution.We also discuss how we calculate the similarity for eachtype of identity attributes. Lastly, we introduce severalmatching algorithms that match identity referencesbased on attribute similarities.

Identity attributesTo determine whether two references are co-referent,we need to compare their common attributes and calcu-late corresponding similarity scores for these attributes.Given our problem definition, we divide all identity attri-butes into three categories and describe their similaritymeasures as follows.

Personal identity attributesPersonal identity attributes are identifiers that arecommonly used in record management systems todistinguish one person from others. In our problemdefinition, each reference r is represented by a nodein the graph. Personal identity attributes are thenode-level attributes, r.A1, r.A2, …, r.Al, Examples

Page 6: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Li and Wang Security Informatics (2015) 4:6 Page 6 of 12

include but not limited to name, date of birth, socialsecurity number, and passport number. For an attri-bute in text format, the similarity between two attri-bute values can be calculated by edit distancemeasures such as the Levenstein Distance [40]. Foran attribute in number or date format, absolutedifference or percentage difference can be used assimilarity measure. We use p to refer to resolutionapproaches based on personal identity attributesalone:

simp ri; rj� � ¼ 1

l

Xlk¼1

sim ri:Ak ; rj:Ak� �

:

If the similarity score simp(ri, rj) is above a threshold,references ri and rj are considered to be co-referent.These approaches tend to perform well for referenceswith duplicates and typos, but are likely to have low re-call for cases involving intentional deceptions.

Social behavior attributesBased on social identity theories, we consider an in-dividual’s behaviors in the society as reflections ofhis/her underlying identity. Social behavior attributesrepresent the common characteristics of the socialgroup(s) that one belongs to. They reflect the psy-chological view of personal identity. Most identitymanagement systems contains information about inci-dents (e.g., credit card transactions, crime miscon-ducts) related to each individuals. In our problemdefinition, each incident is denoted by a hyper-edgein the graph. Therefore, we use incident-based behav-ioral patterns to describe the characteristics of an in-dividual. Each hyper-edge h is associated with a set ofattributes h.B1, h.B2, …, h.Bm (e.g., transaction day/time/amount, crime type). In addition, the participa-tion of r in h has a set of attributes r.h.C1, r.h.C2, …,r.h.Cn (e.g., role, start/end time). Such attributes, cap-turing information about how individual behave incertain incidents, should also be considered as socialbehavior attributes for identity resolution. It is worthnoting that this type of participating attributes wasnot considered in previous work [8, 39], adding theseattributes has enriched the social behavior attributesand made this a more comprehensive framework ofidentity resolution. We denote the set of hyper-edgesinvolving reference r as r.H. Then, each hyper-edgeri.h involving ri is compared with each hyper-edgerj.h' of rj based on attributes {Bk} and {Ck}. The over-all behavior similarity between ri and rj, simb(ri, rj), isthe average of similarity scores between each hyper-edge pairs:

simb ri; rj� � ¼ simb ri:H ; rj:H

� � ¼ 1

ri:Hj j � rj:H�� ��X

h∈ri:H ;h0∈rj:H

simb ri:h; rj:h0� �;

where |r.H| denotes the number of hyper-edges involv-ing r, and simb(ri.h, rj.h') is defined as the average simi-larity score for each attribute of h.Bk and r.h.Ck:

simb ri:h;rj:h0� � ¼ 1

2mn

nXmk¼1

sim h:Bk ; h0:Bkð Þ þm

Xnk¼1

sim ri:h:Ck ; rj:h0:Ck

� � !:

Social relationship attributesThe sociological view is the other level of a social iden-tity that concerns with social relationships of an individ-ual. We can define social relationship attributes tocapture this aspect of social identities. If two referencesri and rj are both related to the same reference rk (e.g., riand rj co-occur with rk in different hyper-edges), this canbe regarded as evidence that these two references areco-referent. We denote the neighborhood of a referencer as Nbr(r). Then, the neighborhood similarity betweentwo references ri and rj, simn(ri, rj), can be defined as:

simn ri; rj� � ¼ simn Nbr rið Þ;Nbr rj

� �� �¼ simn ri:H ; rj:H

� �:

To compute the neighborhood similarity between ref-erences ri and rj, we define hyper-edge neighborhoodsimilarity simn(hi, hj) between two hyper-edges hi and hjas a pair-wise match between their references.Here, the relational similarity between two hyper-

edges hi and hj is computed as the maximum of thesimilarity score between a reference r ∈ hi.R and a refer-ence r' ∈ hj.R:

simn hi; hj� � ¼ maxr∈hi:R;r0∈hj:R simp r; r0ð Þ� �

:

Furthermore, the neighborhood similarity between tworeferences ri and rj is defined as:

simn ri; rj� � ¼ 1

ri:Hj j � rj:H�� �� X

h∈ri:H;h0∈rj:H

simn ri:h; rj:h0� �;

where |r.H| still denotes the number of hyper-edges in-volving r. It is worth noting that the neighborhood ofreference r contains r. Thus, if two references ri and rjdo not share any common neighbor, their neighborhoodsimilarity simn(ri, rj) is equal to their personal identity at-tribute similarity simp(ri, rj). In other words, two refer-ences having the same/similar neighbors can be regarded

Page 7: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Li and Wang Security Informatics (2015) 4:6 Page 7 of 12

as evidence to support that they are more likely to be co-referent, whereas two references not sharing a commonneighbor will not be treated as evidence of them not beingco-referent.Besides, we can define a negative constraint based on

social relationship, i.e., two references that co-occur inthe same hyper-edge cannot refer to the same individual.It is unlikely that one real-world individual can assumetwo different identities in one unique incident.

Aggregated similarity scoreFor a pair of references ri and rj, we have defined threescores that measure their similarity based on personalidentity, social behavior and social relationship attri-butes. In order to take into account different attributesfor identity resolution, we need to aggregate these simi-larity scores to an overall score.Here we use a weighted average method to computing

the overall similarity. It is worth noting that, before be-ing combined, similarity scores should be normalized tothe same scale, e.g., in the range of [0, 1]. Thus, theoverall similarity score between two references ri and rjis a weighted average of all similarity scores: i.e., per-sonal identity similarity, social behavior similarity, andsocial neighborhood similarity:

AVG ri; rj� � ¼ α� simp ri; rj

� �þ β� simb ri; rj� �

þ 1−α−βð Þ � simn ri; rj� �

where weights 0 ≤ α, β ≤ α + β ≤ 1 and they can be ad-justed to control the importance of the three similarityscores.If α = 1, then only personal identity similarity is con-

sidered and AVG(ri, rj) is simply sima(ri, rj). We repre-sent this as simA(ri, rj).If α, β >0 and α + β = 1, then AVG (ri, rj) is the average

of personal identity similarity and social behavior simi-larity whereas social neighborhood similarity is not con-sidered. We represent this as simB(ri, rj).If 0 < α, β ≤ α + β < 1, then all three similarity scores

are considered in the average score. We represent this assimR(ri, rj).

Fig. 2 Pseudo-code of pairwise comparison

Matching algorithmsGiven a collection of references and similarity measuresdefined above, we still design an algorithmic process totraverse and compare all identity pairs and eventually re-veal all underlying individuals. Here, we describe threematching algorithms in existing resolution techniques,namely pairwise comparison, transitive closure, and col-lective clustering.

Pairwise comparisonPair-wise comparison is a basic and simple procedurefor entity resolution. For each pair of references ri and rj,we can compute the similarity score using one of theaforementioned functions. If the similarity score sim(ri, rj)is greater than a predefined threshold θ, we conclude thatri and rj are co-referent. Figure 2 shows the pseudo-code ofthis algorithm:

Transitive closureThe outcome of pairwise comparison only tells whethereach reference pair matches or not. Technically, this stillhas not yet uncovered the underlying individuals in thereference collection. Consider a simple example wherereferences ri and rj are a match, rj and rk are also amatch, while ri and rk are determined to be a non-match. In this case, the pair-wise comparison may pro-duce conflicting resolution outcomes. Hence, to uncoverthe underlying individuals, an extra step is to use transi-tive closure and merge all related matching results. Inthe previous example, even though ri and rk are not suf-ficiently similar, we still consider them as a co-referentpair because ri matches rj and rj matches rk. This processshould be performed iteratively until all transitive clo-sures are reached. Finally, each transitive closure is con-sidered as a distinct individual. Given the output frompairwise comparison, computing transitive closures isstraightforward and efficient. However, it is worth notingthat such a merging process lowers the threshold of thematching criteria. As a result, it tends to reduce falsenegatives but introduce more false positives. Figure 3shows the pseudo-code of this algorithm. Specifically, atStep 4, the results of pairwise comparison in 3.3.1 isused to judge if two references r and r' match. Thisprocess does require recalculating any similarity scores.The minimum distance between references from twoclusters is simply considered as the distance between thetwo clusters. In this sense, this algorithm is a single-linkage or nearest neighbor clustering method.

Collective clusteringCollective clustering is a different matching techniquefor identity resolution [8]. It adopts a greedy agglomera-tive clustering algorithm to find the most similar refer-ences (or clusters) and merge them. There are some

Page 8: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Fig. 3 Pseudo-code of transitive closure

Li and Wang Security Informatics (2015) 4:6 Page 8 of 12

fundamental differences between this algorithm with thetwo introduced above.Collective clustering does requires defining a similarity

function on clusters of references. Each cluster formed isconsidered the same real-world individual. Unlike transi-tive closure that uses single-linkage clustering, collectiveclustering uses an average linkage approach instead. Itdefines the similarity between two clusters ci and cj asthe average similarity between each reference in ci andeach reference in cj:

sim ci;cj� � ¼ 1

ci:Rj j � cj:R�� �� X

r∈cj:R;r0∈cj:R

sim cj:r; cj:r0� �;

where |c.R| represents the number of references incluster c.Figure 4 shows the pseudo-code of this algorithm. Like

transitive closure, collective clustering begins withassigning each reference to a different cluster. Iteratively,this algorithm computes the between-cluster similarityand merges the most similar pair of clusters until thesimilarity drops below the predefined threshold. As clus-ters merge, one cluster can contain multiple referencesthat are determined to be co-referent. A critical step inthis algorithm is that, every time two clusters aremerged, all similarity scores that involve references fromthe merged clusters must be recomputed. Specifically,the merging of two clusters ci and cj into one, whichcould change the linkage structures in the graph, shouldbe regarded as new evidence for computing the similar-ity between other references that co-occur with those inci and cj. Such an iterative computational procedure,which takes into account resolution decisions on allnodes in the graph jointly, makes this clustering algo-rithm collective and distinguishes it with the other pair-wise comparison based algorithms.

Fig. 4 Pseudo-code of collective clustering

Complexity analysisFor a collection of n references, a complete pairwisecomparison approach considers all possible referencepairs as potential candidates for merging. Such a processhas a complexity of O(n2). The collective clustering algo-rithm involves an iterative procedure of recalculatingcluster similarities and therefore requires even morecomputation. Apart from the scaling issue, in realitymost pairs will be rejected while often times only lessthan 1 % of pairs are real matches. Hence, certain block-ing steps should be incorporated and performed beforematching so as to screen out highly unlikely candidatepairs for matching. There are various blocking tech-niques but they often employ one simple and computa-tionally cheap function to group references into anumber of buckets. Buckets can overlap so one refer-ence can belong to multiple buckets. Only referencepairs within the same bucket should be considered can-didate pairs for comparing and merging, whereas a pairof references from two different buckets is not consid-ered as candidate for further comparison. For collectiveclustering, two clusters must have all of their referencesbelonging to the same bucket to make them a candidatepair to be compared and possibly merged. Furthermore,in the agglomerative clustering process, it is not neces-sary to recalculate the similarity scores for every clusterpair. Only those involving references from the mergedclusters need to be recomputed. We have implementedthese aforementioned strategies for reducing complexityin our evaluation. More strategies for reducing the com-plexity of collective clustering can be found in [8].

Comparative evaluationIn this study, we conducted a comparative evaluation ofidentity resolution techniques based on different identityattributes and matching algorithms. We describe our ex-periments and discuss our findings in this section.

DatasetsOur evaluation used both a computer-generated syn-thetic dataset and a real-world dataset. The syntheticdataset was used to conduct a more comprehensive em-pirical evaluation on the proposed framework. The real-world dataset was used to test the usefulness and reli-ability of the proposed framework in a real setting.

A synthetic datasetSynthetic data allows us to embed ground truth in thedata generated for algorithm testing. Here, we adopted atwo-staged data generation method, as described inFig. 5.In the creation stage the algorithm first creates N en-

tities with a node attribute x, a hyper-edge attribute y,and a participating attribute z. We differentiate these

Page 9: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Fig. 5 Pseudo-code of synthetic data generation

Li and Wang Security Informatics (2015) 4:6 Page 9 of 12

three attributes because personal identity attributes (x)can be easily modified or falsified whereas social behav-ior attributes (y and z) tend to be more consistent andless likely to change significantly over time. Next, wecreate M linkages among these N entities to representtheir underlying social relationships. When we createthese links, two entities with similar social behavior at-tribute y, are more likely to be connected to each other

with a probability of Pb ei; ej� � ¼ P

ei:y−ej:yj jb ; i.e., two indi-

viduals with similar interests or behavioral patterns aremore likely to be related. In the generation stage, we cre-ated R hyper-edges that included a set of related entities.An entity may join a hyper-edge of its neighbor with aprobability of Pc. Whenever an entity is to join a hyper-edge, we either choose an existing reference of this en-tity with a probability Pa or create a new reference withits attribute x value following a Gaussian distribution ofN(e.x, 1). Each reference r joins at least one hyper-edge.Each hyper-edge is also assigned an attribute y, which isequal to the average of all participating entities’ behaviorattribute y. This is to assure the hyper-edge’s attributevalue not quite distant from the participating entities’.Similarly, for each participating reference r in h, we as-sign its participating attribute as the average of theunderlying entity’s attribute r.e.z and the hyper-edge’s at-tribute h.y.It is worth noting that this synthetic data generator is

based on a method used in [8] but has two major differ-ences. First, for each entity, we define a separate attri-bute y to represent one’s behavior characteristics that

can be reflected and observed in its involved hyper-edges. Second, our synthetic data allow a reference tojoin more than one hyper-edge, while in [8] each refer-ence only joins a single hyper-edge. This is more realisticin real-world identity management scenarios and re-quires more cost in computing similarities betweenreferences.In our experiments, we chose the following parameter

values for generating the synthetic data: N = 50, M = 100,L = 200, Pa = 0.8, Pb = 0.9, Pc = 0.6, and the value rangesof attributes x, y and z were set to [0, 100). With suchparameter settings, we generated 50 sets of syntheticdata and compared different resolution approaches.Since we limit the attribute values to be a number be-tween 0 and 100 instead of actual identity attributevalues (e.g., DOB, name), we simplified the similarity cal-culation introduced in Section “Identity attributes” asfollows:The similarity between two hyper-edges is defined as:

sim hi; hj� � ¼ hi:y−hj:y

�� ��range yð Þ

The similarity between two participations of referencesin hyper-edges is defined as:

sim ri:h; rj:h0� � ¼ ri:h:z−rj:h0:z

�� ��range zð Þ

A real datasetIdeally, to empirically evaluate and compare these iden-tity resolution techniques, we need a real-world datasetin which duplicate identity records exist. In previous lit-erature of entity resolution, many researchers have usedcitation data as test bed [8]. In order to better differenti-ate the three types of identity attributes, in this study wechose to use a dataset manually extracted from a set ofpublic web pages and news stories related to terrorismand subjectively linked entities mentioned in the articles.These pages and articles were mostly hosted by web sitesof governments and news organizations. This data set wasoriginally collected and annotated by Kubica et al. [41]. Ithas been used in several studies for testing link analysisand alias detection [35, 42]. In total, this dataset consistsof 4088 entity names. Among them, there are 164 namesof terrorists. These 164 names have been manuallyreviewed and recognized as co-references of 20 unique in-dividuals including well-known names such as Osama BinLaden and Abdel Muaz. For instance, there are 12 differ-ent names and references for Osama Bin Laden. Inaddition, this dataset also consists of 5581 links amongthese 4088 names, representing their co-occurrence rela-tionships in articles.

Page 10: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Table 2 Possible outcomes of matching decisions

Decision

Reality Match Non-match

Match True positive (TP) False negative (FN)

Li and Wang Security Informatics (2015) 4:6 Page 10 of 12

The terrorist names are only a small subset of the en-tire set of 4088 names. For each terrorist name, we re-gard all its associated non-terrorist names as its socialbehavior attributes and all its linked terrorist names asits social relationship attributes.

Non-match False positive (FP) True negative (TN)

Experimental designn the experiments, we compared different identity reso-lution techniques in our framework. For identity attri-butes, we considered the following three sets, whichcumulatively added one type of attributes on top ofexisting ones:

� personal identity attributes alone;� personal identity + social behavior attributes; and� personal identity + social behavior + social

relationship attributes

For each attribute set, we conducted identity reso-lution using one of the three matching algorithms: pair-wise comparison, transitive closure, and collective clus-tering. Since the collective clustering algorithm involvessocial linkages between individuals, our implementationof this algorithm captures all three types of attributes.Table 1 lists seven combinations of identity attributesand matching algorithms that correspond to sevenidentity resolution approaches we tested in our com-parative experiments. Each technique is given anacronym as shown.We evaluated the performance of each identity

resolution approach by checking the correctness ofthe matching decisions for each reference pair. Wefollowed most identity resolution studies and choseprecision, recall and F-measure as our evaluationmetrics [8]. As illustrated in Table 2, each matchingdecision on a pair of references is either a “match”(positive) or a “non-match” (negative). A decision canbe either true or false.Based on the decision outcomes, precision, recall, and

F-measure are defined as follows:

Table 1 Experiment design: identity attributes vs. matchingalgorithms

Attributes

Algorithms PersonalIdentity

Personal Identity Personal Identity

+ Social Behavior + Social Behavior

+ Social Relationship

Pair-wise Comparison A B R

Transitive Closure A* B* R*

Collective Resolution – – CR

precision ¼ TPTP þ FP

recall ¼ TPTP þ FN

F−measure ¼ 2� precision � recallprecisionþ recall

Results and discussionResults for synthetic dataIn our experiment on the synthetic dataset, we used theweighted average approach to aggregate all similarityscores. We experimented with different parameter set-tings so as to have a more comprehensive comparison ofdifferent identity resolution techniques. We also experi-mented with different weight parameter values for α(0.0 ~ 1.0) and β (0.0 ~ 0.5). Table 3 summarizes the bestperformance for each of the seven resolution techniques.Solely using personal identity attributes (represented byx in our synthetic data), method A tends to have high re-call (92.62 %) but very low precision (9.43 %). If twoidentity references have very similar personal identity at-tributes, A tends to consider them co-referent. MethodB takes into account social behavior attributes repre-sented by y and z in our synthetic data. It achieved sig-nificantly higher precision (85.66 %) at the cost of lowerrecall (37.71 %) but higher F-measure (44.37 %) than A.Furthermore, method R added social relationship attri-butes and achieved better performance than B in termsof recall (37.87 %) and F-measure (50.41 %). The threemethods using transitive closures (A*, B*, R*) did notseem to give significantly better results than their corre-sponding pairwise comparison methods. Notably, thecollective resolution algorithm CR achieved the highest

Table 3 Peformance of identity resolution for synthetic data

Method α β 1-α-β Precision Recall F-measure

A 1.0 0.0 0.0 9.43 % 92.62 % 16.97 %

A* 1.0 0.0 0.0 7.72 % 94.27 % 14.16 %

B 0.7 0.3 0.0 85.66 % 30.71 % 44.37 %

B* 0.7 0.3 0.0 85.02 % 31.21 % 44.71 %

R 0.5 0.25 0.25 79.22 % 37.87 % 50.41 %

R* 0.5 0.25 0.25 77.11 % 38.64 % 50.44 %

CR 0.5 0.25 0.25 87.95 % 44.15 % 58.17 %

Page 11: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Li and Wang Security Informatics (2015) 4:6 Page 11 of 12

precision (87.95 %) and the highest F-measure (58.17 %),which were both significantly higher than those of theother six methods. Therefore, our experiments on thesynthetic data demonstrated that, by considering allthree types of identity attributes and matching refer-ences in a collective and iterative manner, CR is a moreeffective method for identity resolution.

Results for real-world dataIn the empirical evaluation on the real-world dataset, wealso experimented with aggregated similarity scoresunder different parameter settings so as to have a morecomprehensive comparison of the identity resolutiontechniques. Specifically, we compared different weightvalues for α (0.0 ~ 1.0) and β (0.0 ~ 0.5). Table 4shows the best performance of the identity resolutiontechniques achieved and their corresponding param-eter settings.Table 4 summaries the performance of identity reso-

lution methods using weighted average scores. Whenα = 0.6, β = 0.0 and 1-α-β = 0.4, the collective clusteringresolution (CR) achieved the highest precision (99.34 %),recall (54.30 %), and F-measures (70.22 %).

DiscussionOur experiment results for the synthetic data show that,with the same matching algorithm, using personal iden-tity attributes alone may achieve higher recall but at thecost of low precision and F-measure. As social behaviorattributes being taken into account, we observed a sig-nificant boost in both precision and F-measure. Thisshows that social behavior attributes do contribute tothe performance improvement of identity resolution byreducing false positives. Furthermore, when social rela-tionship attributes were also considered, there weresome minor drops in precision but with generally higherF-measure. As confirmed by results for both syntheticand real data, the approaches using all three types ofidentity attributes (R, R* and CR) outperformed its base-lines in terms of F-measure. In particular, the collectiveclustering algorithm (CR), which not only considersall types of personal and social attributes but makes

Table 4 Peformance of identity resolution for real-world data

Method α β 1-α-β Precision Recall F-measure

A 1.0 0.0 0.0 48.59 % 18.50 % 26.79 %

A* 1.0 0.0 0.0 48.59 % 18.50 % 26.79 %

B 0.9 0.1 0.0 57.98 % 16.47 % 25.65 %

B* 0.9 0.1 0.0 57.98 % 16.47 % 25.65 %

R 0.1 0.0 0.9 97.29 % 38.54 % 55.21 %

R* 0.1 0.0 0.9 97.29 % 38.54 % 55.21 %

CR 0.6 0.0 0.4 99.34 % 54.30 % 70.22 %

matching decisions on identities in a collective fashion,was demonstrated to be the optimal methods that out-perform all others.Notably, the optimal parameter values (i.e., α and β)

varied in our experiments across the two datasets anddifferent algorithms. In general, α was greater than β, in-dicating personal identity attributes being more import-ant than social relationship attributes. Particularly forthe real-world dataset, β values were very small for algo-rithms B and B* and zero for R, R*, and CR. The resultsshow that social behavior attributes did not help muchin finding matching terrorists. In our study, we consid-ered a terrorist’s associated non-terrorist names ashis/her social behavior attributes. The low β valuesmay be caused by the nature of terrorist incidents,which each time involve different non-terrorist peoplesuch as victims. The weight for social relationship attri-butes, 1-α -β, was high in R, R*, and CR, which showed theimportant of social relationship attributes in finding match-ing terrorists.

Conclusions and future directionsIn this paper we introduced a framework of identityresolution techniques that utilizes different identity at-tributes and matching algorithms. Guided by existingidentity theories, we defined and examined three typesof identity cues, namely personal identity attributes,social behavior attributes, and social relationship attri-butes, for identity resolution. We also evaluated threematching algorithms: pair-wise comparison, transitiveclosure, and collective clustering. Our experimentalresults showed that both social behavior and relation-ship attributes improved the performance of identityresolution as compared to using personal identity at-tributes alone. In particular, the collective identityresolution algorithm, which considers all three typesof identity attributes in a collective clustering process,was shown to achieve the best performance among allapproaches.Since this study mainly focuses on comparing the

three types of identity attributes with three identity reso-lution algorithms. These three algorithms are allsimilarity-based unsupervised learning methods. In ourfuture research, we plan to compare them with otherexisting identity resolution algorithms (e.g., supervisedlearning algorithms). Another limitation of this studyis the type and size of datasets employed in experi-ments. To validate, improve, and operationalize theseidentity resolution techniques, we plan to test themon real-world identity datasets with larger scales andmore attributes. Furthermore, another direction ofour future research is to further optimize the com-plexity of identity matching algorithms for more effi-cient processing.

Page 12: A framework of identity resolution: evaluating identity ... · rule-based and machine learning approaches. There have been several rule-based identity resolution approaches based

Li and Wang Security Informatics (2015) 4:6 Page 12 of 12

AbbreviationsCRF: Conditional random field model; PRM: Probabilistic relational model;CR: Collective resolution.

Competing interestsThe authors declare that they have no competing interests.

Authors’ contributionsJL conceived of the study, carried out the research design, conducted theexperiments and drafted the manuscript. AGW co-designed the proposedapproach, participated in the experimental study and drafted the manuscript.Both authors read and approved the final manuscript.

Authors’ informationJL is an Assistant Professor in Business Information Systems, College ofBusiness, at Oregon State University. He earned his Ph.D. in ManagementInformation Systems from the University of Arizona. His research interestsinclude data mining, business analytics, social media analytics, and healthinformatics. He has published in Journal of Management InformationSystems, Decision Support Systems, IEEE Transactions, Journal of theAmerican Society for Information Science and Technology, Journal of theAssociation for Information Systems, Bioinformatics, Communications of theACM, Expert Systems with Applications, and Information Systems Frontiers.GAW is an Associate Professor in the Department of Business InformationTechnology, Pamplin College of Business, at Virginia Tech. He received hisPh.D. in Management Information Systems from the University of Arizona.His research interests include heterogeneous data management, datacleansing, data mining and knowledge discovery, and decision supportsystems. He has published in Production and Operations Management,Communications of the ACM, IEEE Transactions of Systems, Man andCybernetics (Part A), IEEE Computer, Group Decision and Negotiation, andJournal of the American Society for Information Science and Technology. Dr.Wang is a member of the Association for Information Systems (AIS), theDecision Sciences Institute (DSI), and the Institute of Electrical and ElectronicsEngineers (IEEE).

Author details1College of Business, Oregon State University, Corvallis, OR 97331, USA.2Pamplin College of Business, Virginia Tech, Blacksburg, VA 24061, USA.

Received: 23 April 2015 Accepted: 13 July 2015

References1. TH Kean, CA Kojm, P Zelikow, JR Thompson, S Gorton, TJ Roemer, JS Gorelick,

JF Lehman, FF Fielding, B Kerrey, The 9/11 Commission Report. 2004. URL:http://govinfo.library.unt.edu/911/report/index.htm

2. U.S. Department of State: Country Reports on Terrorism 2006. 2007. URL:http://www.state.gov/j/ct/rls/crt/2006/

3. J Li, GA Wang, H Chen, Identity matching using personal and social identityfeatures. Inf. Syst. Front. 13, 101–113 (2010)

4. JS Pistole, Fraudulent Identification Documents and the Implications forHomeland Security. Statement Rec Before House Sel Comm Homel Secur.2003. URL: https://www.fbi.gov/news/testimony/fraudulent-identification-documents-and-the-implications-for-homeland-security

5. J Jonas, Identity resolution: 23 years of practical experience andobservations at scale, in Proc 2006 ACM SIGMOD Int Conf Managdata - SIGMOD’06 (ACM Press, New York, NY, USA, 2006), p. 718 [SIGMOD’06]

6. GA Wang, H Chen, H Atabakhsh, Automatically detecting deceptive criminalidentities. Commun. ACM 47, 70–76 (2004)

7. GA Wang, HC Chen, JJ Xu, H Atabakhsh, Automatically detecting criminalidentity deception: an adaptive detection algorithm. IEEE Trans. Syst. Man.Cybern. Part a-Systems Humans 36, 988–999 (2006)

8. I Bhattacharya, L Getoor, Collective entity resolution in relational data. ACMTrans. Knowl. Discov. from Data 1, 5 (2007)

9. H Pasula, B Marthi, B Milch, S Russell, I Shpitser, Identity uncertainty andcitation matching. Adv Neural Inf Process Syst 2003:1425–1432.

10. T Donaldson, A position paper on collaborative deceit, in AAAI-94 Work PlanInteragent Commun, 1994

11. IP Fellegi, AB Sunter, A theory for record linkage. J. Am. Stat. Assoc. 64,1183–1210 (1969)

12. M Bilenko, R Mooney, W Cohen, P Ravikumar, S Fienberg, Adaptive namematching in information integration. IEEE Intell. Syst. 18, 16–23 (2003)

13. Hernandez MA, Stolfo SJ: The Merge/purge Problem for Large Databases. InProc 1995 ACM SIGMOD Int Conf Manag data (SIGMOD ’95) Edited by CareyMJ, Schneider DA. San Jose, CA; 1995:127–138

14. AE Monge, Matching algorithms within a duplicate detection system. IEEE.Data Eng. Bull 23, 14–20 (2000)

15. AK Elmagarmid, PG Ipeirotis, VS Verykios, Duplicate record detection: asurvey. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)

16. JM Cheek, SR Briggs, Self-consciousness and aspects of identity. J. Res. Pers.16, 401–408 (1982)

17. H Tajfel, JC Turner, The Social Identity Theory of Inter-Group Behavior(Nelson-Hall, Chicago, 1986)

18. JC Turner, Some Current Issues in Research on Social Identity andSelf-Categorization Theories (Blackwell, Oxford, 1999)

19. K Deaux, D Martin, Interpersonal networks and social categories: specifyinglevels of context in identity processes. Soc. Psychol. Q. 66, 101–117 (2003)

20. S Stryker, RT Serpe, Commitment, Identity Salience, and Role Behavior: Theoryand Research Example (Springer-Verlap, New York, 1982)

21. TC Redman, The impact of poor data quality on the typical enterprises.Commun. ACM 41, 79–82 (1998)

22. United Kingdom Home Office: Identity Fraud: A Study. 2002. URL: http://www.homeoffice.gov.uk/cpd/id_fraud-report.pdf

23. R Ananthakrishna, S Chaudhuri, V Ganti, Eliminating Fuzzy Duplicates inData Warehouses. In Proc 28th Int Conf Very Large Data Bases. Hong Kong,China; 2002:586–597.

24. D V Kalashnikov, S Mehrotra, Z Chen, Exploiting relationships for domain-independent data cleaning. In Proc 2005 SIAM Int Conf Data Min. NewportBeach, CA; 2005:262-273

25. H Köpcke, E Rahm, Frameworks for entity matching: a comparison. DataKnowl. Eng. 69, 197–210 (2010)

26. B Marshall, S Kaza, J Xu, H Atabakhsh, T Petersen, C Violette, H Chen,Cross-Jurisdictional Criminal Activity Networks to Support Border andTransportation Security. In Proceedings 7th Int IEEE Conf Intell Transp Syst.Washington, D.C.; 2004:100–105.

27. DE Brown, SC Hagen, Data association methods with applications to lawenforcement. Decis. Support Syst. 34, 369–378 (2003)

28. D Dey, S Sarkar, P De, A distance-based approach to entity reconciliation inheterogeneous databases. IEEE Trans. Knowl. Data. Eng. 14, 567–582 (2002)

29. P Ravikumar, WW Cohen, A Hierarchical Graphical Model for Record Linkage, inProceeding 20th Conf Uncertain Artif Intell (AUAI Press, Arlington, VA, 2004)

30. W. E. Winkler. Methods for record linkage and Bayesian networks. TechnicalReport, Statistical Research Division, U.S. Census Bureau, Washington, DC, 2002.

31. K Nigam, AK McCallum, S Thrun, T Mitchell, Text Classification from Labeledand Unlabeled Documents using EM. Mach. Learn 39, 103–134 (2000)

32. A Culotta, A McCallum, Joint deduplication of multiple record types inrelational data, in Proc 14th ACM Int Conf Inf Knowl Manag (ACM, Bremen,Germany, 2005), pp. 257–258

33. I Bhattacharya, L Getoor, Entity resolution in graphs, in Min graph data(Wiley-Blackwell, Hoboken, 2006), p. 311

34. S Bartunov, A Korshunov, S Park, W Ryu, H Lee, Joint Link-Attribute UserIdentity Resolution in Online Social Networks. In Proc 6th Work Soc NetwMin Anal (SNA-KDD ’12). Beijing, China; 2012

35. P Hsiung, A Moore, D Neill, J Schneider, Alias detection in link data sets. InProc Int Conf Intell Anal. McLean, VA; 2005.

36. J Xu, GA Wang, J Li, M Chau, Complex problem solving: identity matchingbased on social contextual information. J. Assoc. Inf. Syst. 8, 525–545 (2007)

37. MA Jaro, UNIMATCH: A Record Linkage System: User’s Manual. Washington:The Bureau, 1978

38. HB Newcombe, JM Kennendy, SJ Axford, AP James, Automatic linkage ofvital records. Science 130, 954–959 (1959)

39. J Li, GA Wang, Criminal identity resolution using social behavior and relationshipattributes. In Proc 2011 IEEE Int Conf Intell Secur Informatics, ISI 2011; 2011:173–175.

40. VI Levenshtein, Binary codes capable of correcting deletions, insertions, andreversals. Soviet Physics Doklady 10, 707–710 (1966)

41. J Kubica, A Moore, D Cohn, J Schneider, Finding underlying connections: afast graph-based method for link analysis and collaboration queries. Proc.Int. Conf. Mach. Learn. 20, 392–399 (2003)

42. J Kubica, AW Moore, D Cohn, J Schneider, cGraph: A fast graph-basedmethod for link analysis and queries. In Proc 2003 IJCAI Text-MiningLink-Analysis Work; 2003:22–31