Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley...

Post on 21-Dec-2015

216 views 0 download

Transcript of Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley...

Recognizing Recordsfrom the Extracted Cells

of Microfilm Tables

Kenneth M. TubbsDavid W. Embley

Brigham Young University

Supported by NSFSupported by NSF

MotivationMotivation

MotivationMotivation

• Millions want microfilm informationMillions want microfilm information– 1880 census on-line, end of October1880 census on-line, end of October– 3 million hits per hour on familysearch.org3 million hits per hour on familysearch.org

• Acquiring information from microfilmAcquiring information from microfilm– Expensive and time consumingExpensive and time consuming– 2.5 million rolls, 20,000 extractors, 100 hours per 2.5 million rolls, 20,000 extractors, 100 hours per

year: requires 104 yearsyear: requires 104 years• Finding a way to automate: big win!Finding a way to automate: big win!

DifficultiesDifficulties

• DDifferent layouts and styles ifferent layouts and styles

• Different types of dataDifferent types of data

• Sometimes ambiguousSometimes ambiguous

• Type-written labels (OCR)Type-written labels (OCR)

• Hand-written data (?)Hand-written data (?)

Objective: Identify RecordsObjective: Identify Records

• Ontological as well as geometric constraintsOntological as well as geometric constraints• Layout of handwritten valuesLayout of handwritten values• Layout of empty cellsLayout of empty cells

Given a zoned image of a microfilm table, exploit:Given a zoned image of a microfilm table, exploit:

Output field coordinates (labeled with respect to Output field coordinates (labeled with respect to the ontology) and organized into recordsthe ontology) and organized into records

AlgorithmAlgorithm

SQL Insert Statements

SQL Insert Statements

XML Input File(Preprocessed Microfilm Image)

Genealogical Ontology

InputInput OutputOutputMethodMethod

Generate ConfidenceGenerate

Confidence

EnforceConstraints

EnforceConstraints

VerifyResultsVerifyResults

““Training” SetTraining” Set

• 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls• Used to:Used to:

– Identify relationships between table cells Identify relationships between table cells

– Create genealogical ontologyCreate genealogical ontology

– Define features to extractDefine features to extract

– Generate rules (constraints)Generate rules (constraints)

Input: Microfilm TableInput: Microfilm Table

Input: Microfilm TableInput: Microfilm Table

Input: Microfilm TableInput: Microfilm Table

Input FeaturesInput Features

1.1. Coordinates of each cellCoordinates of each cell

2.2. Printed text for label cellsPrinted text for label cells

3.3. Cell empty or notCell empty or not

Input: Microfilm TableInput: Microfilm Table

<<index index sourcesource="="0444770/0444770_2.gif0444770/0444770_2.gif"" ontologyontology="="ontology.xmlontology.xml">">  

<<cellcell rectrect="="7,131,62,2617,131,62,261"" printed_textprinted_text="="Dwelling-houses number in the order Dwelling-houses number in the order of visitation.of visitation."" emptyempty="="00" />" />   

<<cellcell rectrect="="61,132,118,26061,132,118,260"" printed_textprinted_text="="Families number in order of Families number in order of visitation.visitation."" emptyempty="="00" />" />   

<<cellcell rectrect="="119,132,436,261119,132,436,261"" printed_textprinted_text="="The Name of every Person whose The Name of every Person whose usual place of abode on the first day of June, 1840, was in this usual place of abode on the first day of June, 1840, was in this family.family."" emptyempty="="00" />" />      

<<cellcell rectrect="="62,260,120,29562,260,120,295"" printed_textprinted_text="="22"" emptyempty="="00" />" />   

<<cellcell rectrect="="118,260,436,298118,260,436,298"" printed_textprinted_text="="33"" emptyempty="="00" />" />   

<<cellcell rectrect="="7,458,62,4977,458,62,497"" printed_textprinted_text=""="" emptyempty="="11" />" />

. . .. . .

Genealogical OntologyGenealogical Ontology

Genealogical OntologyGenealogical Ontology

Genealogical OntologyGenealogical Ontology <<OntologyOntology>>

<<ObjectSetObjectSet id id="="00"" name name="="PersonPerson"" syn syn=""="" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="11"" name name="="FamilyFamily"" syn syn="="familiesfamilies"" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="22"" name name="="EventEvent"" syn syn=""="" lex lex="="00"/>"/>

<<ObjectSetObjectSet id id="="33"" name name="="AgeAge"" syn syn="="age birthdayage birthday"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="44"" name name="="RelationshipRelationship"" syn syn="="relationship relationrelationship relation"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="55"" name name="="Full NameFull Name"" syn syn="="full name whom whofull name whom who"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="66"" name name="="First NameFirst Name"" syn syn="="first given christianfirst given christian"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="77"" name name="="Middle Name(s)Middle Name(s)"" syn syn="="middle initialmiddle initial"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="88"" name name="="Last NameLast Name"" syn syn="="last surnamelast surname"" lex lex="="11"/>"/>

<<ObjectSetObjectSet id id="="99"" name name="="Title(s)Title(s)"" syn syn="="titletitle"" lex lex="="11"/>"/>

. . .. . .

Generate Confidence Generate Confidence MatricesMatrices

• Relationships between pairs of cellsRelationships between pairs of cells

• Confidence values between 0 and 1Confidence values between 0 and 1

Generate Confidence

Generate Confidence

RelationshipsRelationshipsGenerate Confidence

Generate Confidence

• Label cell describes value cellsLabel cell describes value cells

• Value cells in same row or columnValue cells in same row or column

• Label cells form a multi-level label Label cells form a multi-level label

• Label cells correspond to object setsLabel cells correspond to object sets

• Value factoring and nested valuesValue factoring and nested values

Label Cell and Value CellLabel Cell and Value Cell

A continuous path between a label A continuous path between a label cell and a value cellcell and a value cell

Generate Confidence

Generate Confidence

Label Label

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Label Cell and Value CellLabel Cell and Value Cell

Preferences for label – value Preferences for label – value orientationsorientations

Generate Confidence

Generate Confidence

Label Orientation Confidence

Above 1

Left .75

Right .5

Below .25

Label

Label Cell and Value CellLabel Cell and Value Cell

Compare the height or width of each Compare the height or width of each label cell with each value celllabel cell with each value cell

Generate Confidence

Generate Confidence

LabelLabelOROR

1100Not SimilarNot Similar SimilarSimilar

Value Cell and Value CellValue Cell and Value Cell(Same Row)(Same Row)

A continuous, A continuous, horizontalhorizontal path exists path exists between a pair of value cellsbetween a pair of value cells

Generate Confidence

Generate Confidence

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Value Cell and Value Cell Value Cell and Value Cell (Same Column)(Same Column)

A continuous, A continuous, verticalvertical path exists path exists between a label cell and a value cellbetween a label cell and a value cell

Generate Confidence

Generate Confidence

Confidence =Confidence =

1 If a path exists1 If a path exists

0 If no path exists0 If no path exists

Value Cell and Value CellValue Cell and Value Cell(Geometrically Similar )(Geometrically Similar )

Compare height and widthCompare height and width

Generate Confidence

Generate Confidence

1100Not SimilarNot Similar SimilarSimilar

Multi-level LabelsMulti-level Labels

• Distance between the midpoints Distance between the midpoints

• A line through the midpointsA line through the midpoints

• Share a common borderShare a common border

Generate Confidence

Generate Confidence

Match Label Cells to Object SetsMatch Label Cells to Object Sets

• Location of matched wordsLocation of matched words

• Order of matched wordsOrder of matched words

Generate Confidence

Generate Confidence

Full NameFull Name

LocationLocation

DayDay

FamilyFamily

Object SetsObject Sets

Enforce ConstraintsEnforce Constraints

• Rules for geometric and ontological constraintsRules for geometric and ontological constraints

• Examples:Examples:– Same-type value cells have the same dimensions.Same-type value cells have the same dimensions.

– A family can’t have 100 members.A family can’t have 100 members.

• Iterate over the rules, seeking convergenceIterate over the rules, seeking convergence

Generate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Similar Value CellsSimilar Value CellsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Similar Value CellsSimilar Value CellsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

LowerLowerConfidenceConfidence

Similar Value CellsSimilar Value CellsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Combine AggregationsCombine AggregationsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Multi-level LabelsMulti-level LabelsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

FactoringFactoring

• Observed cardinality in microfilm tableObserved cardinality in microfilm table

• Expected cardinality in genealogy ontologyExpected cardinality in genealogy ontology

Generate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Check Cardinality ConstraintsCheck Cardinality Constraints

Observed CardinalityObserved CardinalityGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints [First Name] per [Family] = [First Name] per [Family] = 4545 / / 99 = = 4.674.67

. . .. . .

Expected CardinalityExpected Cardinality

[First Name] per [Family] = 4.8 * 1 * 1 = [First Name] per [Family] = 4.8 * 1 * 1 = 4.84.8

Generate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Ontological SimilarityOntological SimilarityGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints Increase Confidence of Label Increase Confidence of Label

to Object Set Mappingsto Object Set Mappings

Same Microfilm RollSame Microfilm RollGenerate

Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

Average Confidence Values Across TablesAverage Confidence Values Across Tables

Verify ResultsVerify ResultsGenerate Confidence

Generate Confidence

EnforceConstraints

EnforceConstraints

VerifyResults

VerifyResults

DatabaseDatabase

Full NameFull Name …

Generate Confidence

Generate Confidence

ApplyRules

ApplyRules

VerifyResults

VerifyResults

INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES

('('335,114,521,172335,114,521,172')') INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES

('('335,173,521,231335,173,521,231')') …

SQL Statements Insert Value Cell CoordinatesSQL Statements Insert Value Cell Coordinates

““Training” Set ResultsTraining” Set Results

RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy

Label Cell Describes Label Cell Describes

Value CellValue Cell100%100% 100%100% 100%100%

Value Cells in Same Value Cells in Same Row or ColumnRow or Column

100%100% 100%100% 100%100%

Multilevel LabelsMultilevel Labels 100%100% 100%100% 100%100%

Label Cells – Object Label Cells – Object Set MatchesSet Matches

100%100% 100%100% 100%100%

FactoringFactoring 74.45%74.45% 100%100% 84.65%84.65%

SQL FieldsSQL Fields 99.42%99.42% 100%100% 99.71%99.71%

Ambiguous FactoringAmbiguous Factoring

ExperimentsExperiments

• 75 tables from 15 different microfilm rolls75 tables from 15 different microfilm rolls

• Precision, recall, and accuracyPrecision, recall, and accuracy– Populated SQL fieldsPopulated SQL fields– Each relationshipEach relationship

Test Set ResultsTest Set Results

RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy

Label Cell Describes Label Cell Describes

Value CellValue Cell100%100% 98.12 %98.12 % 98.12 %98.12 %

Value Cells in Same Value Cells in Same Row or ColumnRow or Column

100%100% 100%100% 100%100%

Multilevel LabelsMultilevel Labels 100%100% 99.67%99.67% 99.82%99.82%

Label Cells – Object Label Cells – Object Set MatchesSet Matches

84.98%84.98% 92.76%92.76% 88.1888.18%%

FactoringFactoring 100%100% 93.40%93.40% 93.47%93.47%

SQL FieldsSQL Fields 93.20%93.20% 92.41%92.41% 92.15%92.15%

Factoring over Several Factoring over Several Tables Improved ResultsTables Improved Results

Some Long Label NamesSome Long Label NamesCaused ConfusionCaused Confusion

State here the particular ReligionState here the particular Religionor Religious Denomination,or Religious Denomination,

to which each persons belongs.to which each persons belongs.[Members of Protestant Denomina-[Members of Protestant Denomina-tions are requested not to describetions are requested not to describe

themselves by the vague termthemselves by the vague term‘‘Protestant,’ but to enter theProtestant,’ but to enter the

name of the Particular Church,name of the Particular Church,Denomination, or Body, to whichDenomination, or Body, to whichthey belong.] they belong.]

Ambiguous ColumnsAmbiguous ColumnsCaused ConfusionCaused Confusion

Full NameFull Name

Conclusions

• Identified records in microfilm tables– Geometric and ontological properties– Evidence matrices & corroboration rules

• Accuracy: ~92%

http://www.rdhd.byu.eduhttp://www.fht.byu.edu