1_5_esteva_slidesi

download 1_5_esteva_slidesi

of 17

Transcript of 1_5_esteva_slidesi

  • 7/23/2019 1_5_esteva_slidesi

    1/17

    A Case Study on Entity Resolution for

    Processing of Big Humanities Data

    Xu, Esteva, Trelogan, SwinsonBig Data in the Humanities Workshop

    IEE Big Data

    Santa Clara, October 8 2013

  • 7/23/2019 1_5_esteva_slidesi

    2/17

    Big and Chaotic Humanities Data

    Collections formation process

    Large teams and collaborative work

    Idiosyncratic record-keeping practices Creators are overwhelmed by large

    data and its management

    requirements Redundancy and repetition

  • 7/23/2019 1_5_esteva_slidesi

    3/17

    Managing Humanities Data

    Big humanities datasets Traditional data management methods do

    not scale

    Data management in the Humanities

    context Archival spin Identify and preserve data provenance and

    relationships

    Continuum of data management decisions Access and reuse goals

    Distant and close processing Both are required

  • 7/23/2019 1_5_esteva_slidesi

    4/17

    Background

    ER origins in database research forduplicate records detectionhas broaden into social network

    analysis and web data mininghas not been used for data

    management

    Dearth of work on archival processing

    using computational methods We focus on data organization

    considering duplicate and redundantinformation

  • 7/23/2019 1_5_esteva_slidesi

    5/17

    ER in Data Management

    Proposed as a framework to resolvecases of data redundancy,duplication and disorganization

    Entity is not defined; it could be atheme lost in the data

  • 7/23/2019 1_5_esteva_slidesi

    6/17

    Case Study Collection

  • 7/23/2019 1_5_esteva_slidesi

    7/17

    Workflow

  • 7/23/2019 1_5_esteva_slidesi

    8/17

    Data Modeling: Three Features

    Raw input (per directory)!Austin/Photos to ArtNews_JA2008/Excav_0705-2.psd/Austin/Photos to ArtNews_JA2008/CHBeach.psd

    /Austin/Photos to ArtNews_JA2008/CH_EastWall.psd/Austin/Photos to ArtNews_JA2008/1930s Basilica2.psd

    /Austin/Photos to ArtNews_JA2008/Chers.aerial'02.psd/Austin/Photos to ArtNews_JA2008/Flower&cols.psd

    /Austin/Photos to ArtNews_JA2008/1930s Basilica-closeup.psd

    /Austin/Photos to ArtNews_JA2008/Excav_0705.psd

    Series!Austin/Photos to ArtNews_JA2008/Excav_0705.psd(a1, a2, a3, , an) and where n>3

    Difference between consecutive numerals are within a threshold

    Devised algorithm for series detection

    "

  • 7/23/2019 1_5_esteva_slidesi

    9/17

    Data Modeling: Three Features

    Tags!Austin/Photos to ArtNews_JA2008/CH_EastWall.psdFive tags per directorySelection based on rules

    List of series identified

    Structural iinformation (pairwise comparison)/Austin/Photos to ArtNews_JA2008/Excav_0705.psd

    /Austin/Photos to ArtNews_JA2008/CH_EastWall.psdSuffix and prefix scores between the paths of eachpair of directories

  • 7/23/2019 1_5_esteva_slidesi

    10/17

    Similarity Scoring Model

    Pairwise directories comparison

    Score for top 5 tags

    0 to 5

    Scores for degrees of series overlap

    0 to 2

    Overall similarity score

    [tag score, series score, prefix score, suffixscore]

    0 is the best score

  • 7/23/2019 1_5_esteva_slidesi

    11/17

    Clustering Analysis & Evaluation

    K-mean clustering

    K values: 21, 50, 150

    5 times

    Lowest sum ofsquared errors

    GUI for user

    evaluation

  • 7/23/2019 1_5_esteva_slidesi

    12/17

    TAGS and Series Distributions

    1

    10

    100

    1000

    10000

    100000

    1000000

    10000000

    0 1 2 3 4 5numbers(logscale)

    Scores

    Tag Feature Series feature

  • 7/23/2019 1_5_esteva_slidesi

    13/17

    Clustering Evaluation

    Clusteringclassification

    5 types of clusters

    Representing differentkinds of redundancy

  • 7/23/2019 1_5_esteva_slidesi

    14/17

    Comparing K Values andClustering Types

    0

    0.5

    1

    1.5

    2

    2.5

    21k 50k 125k

    RATIO

    type

    -5

    type

    -4

    type

    -3

    type

    -2

    type

    -1

  • 7/23/2019 1_5_esteva_slidesi

    15/17

    K Values& Action Scoring

    Action scoring Confidence of making a

    data management decision Action no action

    0 no action, 1 needs closeprocessing, 2 complete

    confidence in decision K 21 all actions with low

    confidence (multiple clustertypes within one cluster)

    K 50 highest no action andlowest confidence

    K125 provided highestconfidence in datamanagement actions

    $%&'( )* +,)

    ,)

    -)

    .)

    /)

    +))

    +,)

    ,+012%3&4 ,+014&2%3&4

    5)02%3&4 5)0 4&2%3&4

    +,502%3&4

    +,50 4&2%3&4

    !"#$"%&'(

    "

    *$+,% '%- .$,#".

  • 7/23/2019 1_5_esteva_slidesi

    16/17

    Qualitative Evaluation

    Establish priorities for closeprocessing

    List of action items:discard, merge, left as is,reorganize,

    Found useful relationshipsand reconstruction ofprovenance (3)

    Resolved duplication Identified large sections of

    back-ups in deeply nested

    sections for deletion (1,4) Identified useful

    duplicates to left as is (2)

  • 7/23/2019 1_5_esteva_slidesi

    17/17

    Conclusions

    ER as an approach

    Heuristics for a

    specific case

    Need a benchmark

    Combine otherdata attributes fordecision making

    Interactive modelimprovement

    Visual analytics

    Funding: National Archives and Records Administration &Packards Humanities Institute