Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley...

45
Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF Supported by NSF
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley...

Page 1: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Recognizing Recordsfrom the Extracted Cells

of Microfilm Tables

Kenneth M. TubbsDavid W. Embley

Brigham Young University

Supported by NSFSupported by NSF

Page 2: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

MotivationMotivation

Page 3: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

MotivationMotivation

• Millions want microfilm informationMillions want microfilm information– 1880 census on-line, end of October1880 census on-line, end of October– 3 million hits per hour on familysearch.org3 million hits per hour on familysearch.org

• Acquiring information from microfilmAcquiring information from microfilm– Expensive and time consumingExpensive and time consuming– 2.5 million rolls, 20,000 extractors, 100 hours per 2.5 million rolls, 20,000 extractors, 100 hours per

year: requires 104 yearsyear: requires 104 years• Finding a way to automate: big win!Finding a way to automate: big win!

Page 4: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

DifficultiesDifficulties

• DDifferent layouts and styles ifferent layouts and styles

• Different types of dataDifferent types of data

• Sometimes ambiguousSometimes ambiguous

• Type-written labels (OCR)Type-written labels (OCR)

• Hand-written data (?)Hand-written data (?)

Page 5: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Objective: Identify RecordsObjective: Identify Records

• Ontological as well as geometric constraintsOntological as well as geometric constraints• Layout of handwritten valuesLayout of handwritten values• Layout of empty cellsLayout of empty cells

Given a zoned image of a microfilm table, exploit:Given a zoned image of a microfilm table, exploit:

Output field coordinates (labeled with respect to Output field coordinates (labeled with respect to the ontology) and organized into recordsthe ontology) and organized into records

Page 6: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

AlgorithmAlgorithm

SQL Insert Statements

SQL Insert Statements

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

Generate ConfidenceGenerate

Confidence

EnforceConstraints

EnforceConstraints

VerifyResultsVerifyResults

Page 7: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

““Training” SetTraining” Set

• 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls• Used to:Used to:

– Identify relationships between table cells Identify relationships between table cells

– Create genealogical ontologyCreate genealogical ontology

– Define features to extractDefine features to extract

– Generate rules (constraints)Generate rules (constraints)

Page 8: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Input: Microfilm TableInput: Microfilm Table

Page 9: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Input: Microfilm TableInput: Microfilm Table

Page 10: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Input: Microfilm TableInput: Microfilm Table

Input FeaturesInput Features

1.1. Coordinates of each cellCoordinates of each cell

2.2. Printed text for label cellsPrinted text for label cells

3.3. Cell empty or notCell empty or not

Page 11: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Input: Microfilm TableInput: Microfilm Table

<<index index sourcesource="="0444770/0444770_2.gif0444770/0444770_2.gif"" ontologyontology="="ontology.xmlontology.xml">">  

<<cellcell rectrect="="7,131,62,2617,131,62,261"" printed_textprinted_text="="Dwelling-houses number in the order Dwelling-houses number in the order of visitation.of visitation."" emptyempty="="00" />" />   

<<cellcell rectrect="="61,132,118,26061,132,118,260"" printed_textprinted_text="="Families number in order of Families number in order of visitation.visitation."" emptyempty="="00" />" />   

<<cellcell rectrect="="119,132,436,261119,132,436,261"" printed_textprinted_text="="The Name of every Person whose The Name of every Person whose usual place of abode on the first day of June, 1840, was in this usual place of abode on the first day of June, 1840, was in this family.family."" emptyempty="="00" />" />      

<<cellcell rectrect="="62,260,120,29562,260,120,295"" printed_textprinted_text="="22"" emptyempty="="00" />" />   

<<cellcell rectrect="="118,260,436,298118,260,436,298"" printed_textprinted_text="="33"" emptyempty="="00" />" />   

<<cellcell rectrect="="7,458,62,4977,458,62,497"" printed_textprinted_text=""="" emptyempty="="11" />" />

. . .. . .

Page 12: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Genealogical OntologyGenealogical Ontology

Page 13: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Genealogical OntologyGenealogical Ontology

Page 14: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Genealogical OntologyGenealogical Ontology <<OntologyOntology>>

<<ObjectSetObjectSet id id="="00"" name name="="PersonPerson"" syn syn=""="" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="11"" name name="="FamilyFamily"" syn syn="="familiesfamilies"" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="22"" name name="="EventEvent"" syn syn=""="" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="33"" name name="="AgeAge"" syn syn="="age birthdayage birthday"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="44"" name name="="RelationshipRelationship"" syn syn="="relationship relationrelationship relation"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="55"" name name="="Full NameFull Name"" syn syn="="full name whom whofull name whom who"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="66"" name name="="First NameFirst Name"" syn syn="="first given christianfirst given christian"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="77"" name name="="Middle Name(s)Middle Name(s)"" syn syn="="middle initialmiddle initial"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="88"" name name="="Last NameLast Name"" syn syn="="last surnamelast surname"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="99"" name name="="Title(s)Title(s)"" syn syn="="titletitle"" lex lex="="11"/>"/>

. . .. . .

Page 15: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Generate Confidence Generate Confidence MatricesMatrices

• Relationships between pairs of cellsRelationships between pairs of cells

• Confidence values between 0 and 1Confidence values between 0 and 1

Generate Confidence

Generate Confidence

Page 16: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

RelationshipsRelationshipsGenerate Confidence

Generate Confidence

• Label cell describes value cellsLabel cell describes value cells

• Value cells in same row or columnValue cells in same row or column

• Label cells form a multi-level label Label cells form a multi-level label

• Label cells correspond to object setsLabel cells correspond to object sets

• Value factoring and nested valuesValue factoring and nested values

Page 17: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Label Cell and Value CellLabel Cell and Value Cell

A continuous path between a label A continuous path between a label cell and a value cellcell and a value cell

Generate Confidence

Generate Confidence

Label Label

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Page 18: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Label Cell and Value CellLabel Cell and Value Cell

Preferences for label – value Preferences for label – value orientationsorientations

Generate Confidence

Generate Confidence

Label Orientation Confidence

Above 1

Left .75

Right .5

Below .25

Label

Page 19: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Label Cell and Value CellLabel Cell and Value Cell

Compare the height or width of each Compare the height or width of each label cell with each value celllabel cell with each value cell

Generate Confidence

Generate Confidence

LabelLabelOROR

1100Not SimilarNot Similar SimilarSimilar

Page 20: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Value Cell and Value CellValue Cell and Value Cell(Same Row)(Same Row)

A continuous, A continuous, horizontalhorizontal path exists path exists between a pair of value cellsbetween a pair of value cells

Generate Confidence

Generate Confidence

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Page 21: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Value Cell and Value Cell Value Cell and Value Cell (Same Column)(Same Column)

A continuous, A continuous, verticalvertical path exists path exists between a label cell and a value cellbetween a label cell and a value cell

Generate Confidence

Generate Confidence

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Page 22: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Value Cell and Value CellValue Cell and Value Cell(Geometrically Similar )(Geometrically Similar )

Compare height and widthCompare height and width

Generate Confidence

Generate Confidence

1100Not SimilarNot Similar SimilarSimilar

Page 23: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Multi-level LabelsMulti-level Labels

• Distance between the midpoints Distance between the midpoints

• A line through the midpointsA line through the midpoints

• Share a common borderShare a common border

Generate Confidence

Generate Confidence

Page 24: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Match Label Cells to Object SetsMatch Label Cells to Object Sets

• Location of matched wordsLocation of matched words

• Order of matched wordsOrder of matched words

Generate Confidence

Generate Confidence

Full NameFull Name

LocationLocation

DayDay

FamilyFamily

Object SetsObject Sets

Page 25: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Enforce ConstraintsEnforce Constraints

• Rules for geometric and ontological constraintsRules for geometric and ontological constraints

• Examples:Examples:– Same-type value cells have the same dimensions.Same-type value cells have the same dimensions.

– A family can’t have 100 members.A family can’t have 100 members.

• Iterate over the rules, seeking convergenceIterate over the rules, seeking convergence

Generate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Page 26: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Similar Value CellsSimilar Value CellsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Page 27: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Similar Value CellsSimilar Value CellsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

LowerLowerConfidenceConfidence

Page 28: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Similar Value CellsSimilar Value CellsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Page 29: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Combine AggregationsCombine AggregationsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Page 30: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Multi-level LabelsMulti-level LabelsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Page 31: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

FactoringFactoring

• Observed cardinality in microfilm tableObserved cardinality in microfilm table

• Expected cardinality in genealogy ontologyExpected cardinality in genealogy ontology

Generate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Check Cardinality ConstraintsCheck Cardinality Constraints

Page 32: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Observed CardinalityObserved CardinalityGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints [First Name] per [Family] = [First Name] per [Family] = 4545 / / 99 = = 4.674.67

. . .. . .

Page 33: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Expected CardinalityExpected Cardinality

[First Name] per [Family] = 4.8 * 1 * 1 = [First Name] per [Family] = 4.8 * 1 * 1 = 4.84.8

Generate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Page 34: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Ontological SimilarityOntological SimilarityGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints Increase Confidence of Label Increase Confidence of Label

to Object Set Mappingsto Object Set Mappings

Page 35: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Same Microfilm RollSame Microfilm RollGenerate

Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Average Confidence Values Across TablesAverage Confidence Values Across Tables

Page 36: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Verify ResultsVerify ResultsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

VerifyResults

VerifyResults

Page 37: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

DatabaseDatabase

Full NameFull Name …

Generate Confidence

Generate Confidence

ApplyRules

ApplyRules

VerifyResults

VerifyResults

INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES

('('335,114,521,172335,114,521,172')') INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES

('('335,173,521,231335,173,521,231')') …

SQL Statements Insert Value Cell CoordinatesSQL Statements Insert Value Cell Coordinates

Page 38: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

““Training” Set ResultsTraining” Set Results

RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy

Label Cell Describes Label Cell Describes

Value CellValue Cell100%100% 100%100% 100%100%

Value Cells in Same Value Cells in Same Row or ColumnRow or Column

100%100% 100%100% 100%100%

Multilevel LabelsMultilevel Labels 100%100% 100%100% 100%100%

Label Cells – Object Label Cells – Object Set MatchesSet Matches

100%100% 100%100% 100%100%

FactoringFactoring 74.45%74.45% 100%100% 84.65%84.65%

SQL FieldsSQL Fields 99.42%99.42% 100%100% 99.71%99.71%

Page 39: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Ambiguous FactoringAmbiguous Factoring

Page 40: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

ExperimentsExperiments

• 75 tables from 15 different microfilm rolls75 tables from 15 different microfilm rolls

• Precision, recall, and accuracyPrecision, recall, and accuracy– Populated SQL fieldsPopulated SQL fields– Each relationshipEach relationship

Page 41: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Test Set ResultsTest Set Results

RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy

Label Cell Describes Label Cell Describes

Value CellValue Cell100%100% 98.12 %98.12 % 98.12 %98.12 %

Value Cells in Same Value Cells in Same Row or ColumnRow or Column

100%100% 100%100% 100%100%

Multilevel LabelsMultilevel Labels 100%100% 99.67%99.67% 99.82%99.82%

Label Cells – Object Label Cells – Object Set MatchesSet Matches

84.98%84.98% 92.76%92.76% 88.1888.18%%

FactoringFactoring 100%100% 93.40%93.40% 93.47%93.47%

SQL FieldsSQL Fields 93.20%93.20% 92.41%92.41% 92.15%92.15%

Page 42: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Factoring over Several Factoring over Several Tables Improved ResultsTables Improved Results

Page 43: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Some Long Label NamesSome Long Label NamesCaused ConfusionCaused Confusion

State here the particular ReligionState here the particular Religionor Religious Denomination,or Religious Denomination,

to which each persons belongs.to which each persons belongs.[Members of Protestant Denomina-[Members of Protestant Denomina-tions are requested not to describetions are requested not to describe

themselves by the vague termthemselves by the vague term‘‘Protestant,’ but to enter theProtestant,’ but to enter the

name of the Particular Church,name of the Particular Church,Denomination, or Body, to whichDenomination, or Body, to whichthey belong.] they belong.]

Page 44: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Ambiguous ColumnsAmbiguous ColumnsCaused ConfusionCaused Confusion

Full NameFull Name

Page 45: Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.

Conclusions

• Identified records in microfilm tables– Geometric and ontological properties– Evidence matrices & corroboration rules

• Accuracy: ~92%

http://www.rdhd.byu.eduhttp://www.fht.byu.edu