1_5_esteva_slidesi

7/23/2019 1_5_esteva_slidesi

1/17

A Case Study on Entity Resolution for

Processing of Big Humanities Data

Xu, Esteva, Trelogan, SwinsonBig Data in the Humanities Workshop

IEE Big Data

Santa Clara, October 8 2013


2/17

Big and Chaotic Humanities Data

Collections formation process

Large teams and collaborative work

Idiosyncratic record-keeping practices Creators are overwhelmed by large

data and its management

requirements Redundancy and repetition


3/17

Managing Humanities Data

Big humanities datasets Traditional data management methods do

not scale

Data management in the Humanities

context Archival spin Identify and preserve data provenance and

relationships

Continuum of data management decisions Access and reuse goals

Distant and close processing Both are required


4/17

Background

ER origins in database research forduplicate records detectionhas broaden into social network

analysis and web data mininghas not been used for data

management

Dearth of work on archival processing

using computational methods We focus on data organization

considering duplicate and redundantinformation


5/17

ER in Data Management

Proposed as a framework to resolvecases of data redundancy,duplication and disorganization

Entity is not defined; it could be atheme lost in the data


6/17

Case Study Collection


7/17

Workflow


8/17

Data Modeling: Three Features

Raw input (per directory)!Austin/Photos to ArtNews_JA2008/Excav_0705-2.psd/Austin/Photos to ArtNews_JA2008/CHBeach.psd

/Austin/Photos to ArtNews_JA2008/CH_EastWall.psd/Austin/Photos to ArtNews_JA2008/1930s Basilica2.psd

/Austin/Photos to ArtNews_JA2008/Chers.aerial'02.psd/Austin/Photos to ArtNews_JA2008/Flower&cols.psd

/Austin/Photos to ArtNews_JA2008/1930s Basilica-closeup.psd

/Austin/Photos to ArtNews_JA2008/Excav_0705.psd

Series!Austin/Photos to ArtNews_JA2008/Excav_0705.psd(a1, a2, a3, , an) and where n>3

Difference between consecutive numerals are within a threshold

Devised algorithm for series detection

"


9/17

Data Modeling: Three Features

Tags!Austin/Photos to ArtNews_JA2008/CH_EastWall.psdFive tags per directorySelection based on rules

List of series identified

Structural iinformation (pairwise comparison)/Austin/Photos to ArtNews_JA2008/Excav_0705.psd

/Austin/Photos to ArtNews_JA2008/CH_EastWall.psdSuffix and prefix scores between the paths of eachpair of directories


10/17

Similarity Scoring Model

Pairwise directories comparison

Score for top 5 tags

0 to 5

Scores for degrees of series overlap

0 to 2

Overall similarity score

[tag score, series score, prefix score, suffixscore]

0 is the best score


11/17

Clustering Analysis & Evaluation

K-mean clustering

K values: 21, 50, 150

5 times

Lowest sum ofsquared errors

GUI for user

evaluation


12/17

TAGS and Series Distributions

1

10

100

1000

10000

100000

1000000

10000000

0 1 2 3 4 5numbers(logscale)

Scores

Tag Feature Series feature


13/17

Clustering Evaluation

Clusteringclassification

5 types of clusters

Representing differentkinds of redundancy


14/17

Comparing K Values andClustering Types

0

0.5

1

1.5

2

2.5

21k 50k 125k

RATIO

type

-5

type

-4

type

-3

type

-2

type

-1


15/17

K Values& Action Scoring

Action scoring Confidence of making a

data management decision Action no action

0 no action, 1 needs closeprocessing, 2 complete

confidence in decision K 21 all actions with low

confidence (multiple clustertypes within one cluster)

K 50 highest no action andlowest confidence

K125 provided highestconfidence in datamanagement actions

$%&'( )* +,)

,)

-)

.)

/)

+))

+,)

,+012%3&4 ,+014&2%3&4

5)02%3&4 5)0 4&2%3&4

+,502%3&4

+,50 4&2%3&4

!"#$"%&'(

"

*$+,% '%- .$,#".


16/17

Qualitative Evaluation

Establish priorities for closeprocessing

List of action items:discard, merge, left as is,reorganize,

Found useful relationshipsand reconstruction ofprovenance (3)

Resolved duplication Identified large sections of

back-ups in deeply nested

sections for deletion (1,4) Identified useful

duplicates to left as is (2)


17/17

Conclusions

ER as an approach

Heuristics for a

specific case

Need a benchmark

Combine otherdata attributes fordecision making

Interactive modelimprovement

Visual analytics

Funding: National Archives and Records Administration &Packards Humanities Institute

1_5_esteva_slidesi

Documents

Transcript of 1_5_esteva_slidesi