1_5_esteva_slidesi
-
Upload
saif-uzzol -
Category
Documents
-
view
216 -
download
0
Transcript of 1_5_esteva_slidesi
-
7/23/2019 1_5_esteva_slidesi
1/17
A Case Study on Entity Resolution for
Processing of Big Humanities Data
Xu, Esteva, Trelogan, SwinsonBig Data in the Humanities Workshop
IEE Big Data
Santa Clara, October 8 2013
-
7/23/2019 1_5_esteva_slidesi
2/17
Big and Chaotic Humanities Data
Collections formation process
Large teams and collaborative work
Idiosyncratic record-keeping practices Creators are overwhelmed by large
data and its management
requirements Redundancy and repetition
-
7/23/2019 1_5_esteva_slidesi
3/17
Managing Humanities Data
Big humanities datasets Traditional data management methods do
not scale
Data management in the Humanities
context Archival spin Identify and preserve data provenance and
relationships
Continuum of data management decisions Access and reuse goals
Distant and close processing Both are required
-
7/23/2019 1_5_esteva_slidesi
4/17
Background
ER origins in database research forduplicate records detectionhas broaden into social network
analysis and web data mininghas not been used for data
management
Dearth of work on archival processing
using computational methods We focus on data organization
considering duplicate and redundantinformation
-
7/23/2019 1_5_esteva_slidesi
5/17
ER in Data Management
Proposed as a framework to resolvecases of data redundancy,duplication and disorganization
Entity is not defined; it could be atheme lost in the data
-
7/23/2019 1_5_esteva_slidesi
6/17
Case Study Collection
-
7/23/2019 1_5_esteva_slidesi
7/17
Workflow
-
7/23/2019 1_5_esteva_slidesi
8/17
Data Modeling: Three Features
Raw input (per directory)!Austin/Photos to ArtNews_JA2008/Excav_0705-2.psd/Austin/Photos to ArtNews_JA2008/CHBeach.psd
/Austin/Photos to ArtNews_JA2008/CH_EastWall.psd/Austin/Photos to ArtNews_JA2008/1930s Basilica2.psd
/Austin/Photos to ArtNews_JA2008/Chers.aerial'02.psd/Austin/Photos to ArtNews_JA2008/Flower&cols.psd
/Austin/Photos to ArtNews_JA2008/1930s Basilica-closeup.psd
/Austin/Photos to ArtNews_JA2008/Excav_0705.psd
Series!Austin/Photos to ArtNews_JA2008/Excav_0705.psd(a1, a2, a3, , an) and where n>3
Difference between consecutive numerals are within a threshold
Devised algorithm for series detection
"
-
7/23/2019 1_5_esteva_slidesi
9/17
Data Modeling: Three Features
Tags!Austin/Photos to ArtNews_JA2008/CH_EastWall.psdFive tags per directorySelection based on rules
List of series identified
Structural iinformation (pairwise comparison)/Austin/Photos to ArtNews_JA2008/Excav_0705.psd
/Austin/Photos to ArtNews_JA2008/CH_EastWall.psdSuffix and prefix scores between the paths of eachpair of directories
-
7/23/2019 1_5_esteva_slidesi
10/17
Similarity Scoring Model
Pairwise directories comparison
Score for top 5 tags
0 to 5
Scores for degrees of series overlap
0 to 2
Overall similarity score
[tag score, series score, prefix score, suffixscore]
0 is the best score
-
7/23/2019 1_5_esteva_slidesi
11/17
Clustering Analysis & Evaluation
K-mean clustering
K values: 21, 50, 150
5 times
Lowest sum ofsquared errors
GUI for user
evaluation
-
7/23/2019 1_5_esteva_slidesi
12/17
TAGS and Series Distributions
1
10
100
1000
10000
100000
1000000
10000000
0 1 2 3 4 5numbers(logscale)
Scores
Tag Feature Series feature
-
7/23/2019 1_5_esteva_slidesi
13/17
Clustering Evaluation
Clusteringclassification
5 types of clusters
Representing differentkinds of redundancy
-
7/23/2019 1_5_esteva_slidesi
14/17
Comparing K Values andClustering Types
0
0.5
1
1.5
2
2.5
21k 50k 125k
RATIO
type
-5
type
-4
type
-3
type
-2
type
-1
-
7/23/2019 1_5_esteva_slidesi
15/17
K Values& Action Scoring
Action scoring Confidence of making a
data management decision Action no action
0 no action, 1 needs closeprocessing, 2 complete
confidence in decision K 21 all actions with low
confidence (multiple clustertypes within one cluster)
K 50 highest no action andlowest confidence
K125 provided highestconfidence in datamanagement actions
$%&'( )* +,)
,)
-)
.)
/)
+))
+,)
,+012%3&4 ,+014&2%3&4
5)02%3&4 5)0 4&2%3&4
+,502%3&4
+,50 4&2%3&4
!"#$"%&'(
"
*$+,% '%- .$,#".
-
7/23/2019 1_5_esteva_slidesi
16/17
Qualitative Evaluation
Establish priorities for closeprocessing
List of action items:discard, merge, left as is,reorganize,
Found useful relationshipsand reconstruction ofprovenance (3)
Resolved duplication Identified large sections of
back-ups in deeply nested
sections for deletion (1,4) Identified useful
duplicates to left as is (2)
-
7/23/2019 1_5_esteva_slidesi
17/17
Conclusions
ER as an approach
Heuristics for a
specific case
Need a benchmark
Combine otherdata attributes fordecision making
Interactive modelimprovement
Visual analytics
Funding: National Archives and Records Administration &Packards Humanities Institute