Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs...
-
date post
22-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs...
![Page 1: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/1.jpg)
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables
Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.
A Thesis Submitted to the Faculty ofBrigham Young University
![Page 2: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/2.jpg)
MotivationMotivation
• Millions of people want genealogical Millions of people want genealogical informationinformation
• Acquiring microfilm is expensive and Acquiring microfilm is expensive and time consumingtime consuming
![Page 3: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/3.jpg)
Extraction ProblemExtraction Problem
• Searching microfilm by hand is slow, Searching microfilm by hand is slow, error prone, and tediouserror prone, and tedious
• Extraction by hand requires enormous Extraction by hand requires enormous amounts of time and manpoweramounts of time and manpower
![Page 4: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/4.jpg)
DifficultiesDifficulties
• Tables Tables have different layouts and styles have different layouts and styles
• Tables contain different recordsTables contain different records
• Tables do not use a uniform schemaTables do not use a uniform schema
• Tables lack information and are ambiguousTables lack information and are ambiguous
![Page 5: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/5.jpg)
Related WorkRelated Work
• Current work exploits the geometric Current work exploits the geometric properties of tablesproperties of tables
• Regular expressions, grammars, Regular expressions, grammars, probabilistic models, and templatesprobabilistic models, and templates
• They ignore the ontological constraints of They ignore the ontological constraints of this informationthis information
![Page 6: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/6.jpg)
ContributionsContributions
• Exploit both ontological and geometric Exploit both ontological and geometric constraintsconstraints
• Identify complex recordsIdentify complex records
• Work with tables with hand-written Work with tables with hand-written valuesvalues
![Page 7: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/7.jpg)
AlgorithmAlgorithm
SQL Insert Statements
SQL Insert Statements
XML Input File(Preprocessed Microfilm Image)
Genealogical Ontology
InputInput OutputOutputMethodMethod
Generate ConfidencesGenerate
Confidences
EnforceConstraints
EnforceConstraints
VerifyResultsVerifyResults
![Page 8: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/8.jpg)
Training SetTraining Set
• 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls• Used to:Used to:
– Identify relationships between table cells Identify relationships between table cells
– Create genealogical ontologyCreate genealogical ontology
– Define features to extractDefine features to extract
– Generate rules (constraints)Generate rules (constraints)
![Page 9: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/9.jpg)
Input: Microfilm TableInput: Microfilm Table
![Page 10: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/10.jpg)
Input: Microfilm TableInput: Microfilm Table
![Page 11: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/11.jpg)
Input: Microfilm TableInput: Microfilm Table
Input FeaturesInput Features
1.1. Coordinates of each cell.Coordinates of each cell.
2.2. Printed text for label cells.Printed text for label cells.
3.3. Whether or not each value Whether or not each value cell is empty.cell is empty.
![Page 12: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/12.jpg)
Input: Microfilm TableInput: Microfilm Table
<<index index sourcesource="="0444770/0444770_2.gif0444770/0444770_2.gif"" ontologyontology="="ontology.xmlontology.xml">">
<<cellcell rectrect="="7,131,62,2617,131,62,261"" printed_textprinted_text="="Dwelling-houses number in the order Dwelling-houses number in the order of visitation.of visitation."" emptyempty="="00" />" />
<<cellcell rectrect="="61,132,118,26061,132,118,260"" printed_textprinted_text="="Families number in order of Families number in order of visitation.visitation."" emptyempty="="00" />" />
<<cellcell rectrect="="119,132,436,261119,132,436,261"" printed_textprinted_text="="The Name of every Person whose The Name of every Person whose usual place of abode on the first day of June, 1840, was in this usual place of abode on the first day of June, 1840, was in this family.family."" emptyempty="="00" />" />
<<cellcell rectrect="="62,260,120,29562,260,120,295"" printed_textprinted_text="="22"" emptyempty="="00" />" />
<<cellcell rectrect="="118,260,436,298118,260,436,298"" printed_textprinted_text="="33"" emptyempty="="00" />" />
<<cellcell rectrect="="7,458,62,4977,458,62,497"" printed_textprinted_text=""="" emptyempty="="11" />" />
. . .. . .
![Page 13: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/13.jpg)
Genealogical OntologyGenealogical Ontology
![Page 14: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/14.jpg)
Genealogical OntologyGenealogical Ontology
![Page 15: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/15.jpg)
Genealogical OntologyGenealogical Ontology <<OntologyOntology>>
<<ObjectSetObjectSet id id="="00"" name name="="PersonPerson"" syn syn=""="" lex lex="="00"/>"/>
<<ObjectSetObjectSet id id="="11"" name name="="FamilyFamily"" syn syn="="familiesfamilies"" lex lex="="00"/>"/>
<<ObjectSetObjectSet id id="="22"" name name="="EventEvent"" syn syn=""="" lex lex="="00"/>"/>
<<ObjectSetObjectSet id id="="33"" name name="="AgeAge"" syn syn="="age birthdayage birthday"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="44"" name name="="RelationshipRelationship"" syn syn="="relationship relationrelationship relation"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="55"" name name="="Full NameFull Name"" syn syn="="full name whom whofull name whom who"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="66"" name name="="First NameFirst Name"" syn syn="="first given christianfirst given christian"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="77"" name name="="Middle Name(s)Middle Name(s)"" syn syn="="middle initialmiddle initial"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="88"" name name="="Last NameLast Name"" syn syn="="last surnamelast surname"" lex lex="="11"/>"/>
<<ObjectSetObjectSet id id="="99"" name name="="Title(s)Title(s)"" syn syn="="titletitle"" lex lex="="11"/>"/>
. . .. . .
![Page 16: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/16.jpg)
Generate ConfidencesGenerate Confidences
• Confidence of relationships Confidence of relationships between pairs of cellsbetween pairs of cells
• Generate confidence values Generate confidence values between 0 and 1between 0 and 1
Generate Confidences
Generate Confidences
![Page 17: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/17.jpg)
RelationshipsRelationshipsGenerate Confidences
Generate Confidences
• A label cell describes a value cell A label cell describes a value cell
• Value cells in same row or columnValue cells in same row or column
• Label cells form a multi-level label Label cells form a multi-level label
• A label cell maps to an object setA label cell maps to an object set
• Identify factoringIdentify factoring
![Page 18: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/18.jpg)
Label Cell and Value CellLabel Cell and Value Cell
A continuous path between a label A continuous path between a label cell and a value cellcell and a value cell
Generate Confidences
Generate Confidences
Label Label
Confidence =Confidence =
1 If a path exists1 If a path exists
0 If no path exists0 If no path exists
![Page 19: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/19.jpg)
Label Cell and Value CellLabel Cell and Value Cell
Preferences for label – value Preferences for label – value orientationsorientations
Generate Confidences
Generate Confidences
Label Orientation Confidence
Above 1
Left .75
Right .5
Below .25
Label
![Page 20: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/20.jpg)
Label Cell and Value CellLabel Cell and Value Cell
Compare the height or width of each Compare the height or width of each label cell with each value celllabel cell with each value cell
Generate Confidences
Generate Confidences
LabelLabelOROR
1100Not SimilarNot Similar SimilarSimilar
![Page 21: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/21.jpg)
Value Cell and Value CellValue Cell and Value Cell(Same Row)(Same Row)
A continuous, A continuous, horizontalhorizontal path exists path exists between a pair of value cellsbetween a pair of value cells
Generate Confidences
Generate Confidences
Confidence =Confidence =
1 If a path exists1 If a path exists
0 If no path exists0 If no path exists
![Page 22: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/22.jpg)
Value Cell and Value Cell Value Cell and Value Cell (Same Column)(Same Column)
A continuous, A continuous, verticalvertical path exists path exists between a label cell and a value cellbetween a label cell and a value cell
Generate Confidences
Generate Confidences
Confidence =Confidence =
1 If a path exists1 If a path exists
0 If no path exists0 If no path exists
![Page 23: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/23.jpg)
Value Cell and Value CellValue Cell and Value Cell(Geometrically Similar )(Geometrically Similar )
Compare height and widthCompare height and width
Generate Confidences
Generate Confidences
1100Not SimilarNot Similar SimilarSimilar
![Page 24: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/24.jpg)
Multi-level LabelsMulti-level Labels
• Distance between the midpoints Distance between the midpoints
• A line through the midpointsA line through the midpoints
• Share a common borderShare a common border
Generate Confidences
Generate Confidences
![Page 25: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/25.jpg)
Match Label Cells to Object SetsMatch Label Cells to Object Sets
• Match synonyms of object sets to Match synonyms of object sets to words in a labelwords in a label– Location of matched wordsLocation of matched words– Order that object sets match wordsOrder that object sets match words
Generate Confidences
Generate Confidences
Full NameFull Name
LocationLocation
DayDay
FamilyFamily
Object SetsObject Sets
![Page 26: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/26.jpg)
Enforce ConstraintsEnforce Constraints• A set of rules describe geometric and ontological constraints.A set of rules describe geometric and ontological constraints.
• For example:For example:– Value cells of the same type have the same dimensionsValue cells of the same type have the same dimensions– A family can’t have 100 membersA family can’t have 100 members
• The algorithm iterates over the rulesThe algorithm iterates over the rules
Generate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints
![Page 27: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/27.jpg)
1. Similar Value Cells1. Similar Value CellsGenerate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints
![Page 28: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/28.jpg)
1. Similar Value Cells1. Similar Value CellsGenerate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints
LowerLowerConfidenceConfidence
![Page 29: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/29.jpg)
1. Similar Value Cells1. Similar Value CellsGenerate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints
![Page 30: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/30.jpg)
2. Combine Aggregations2. Combine AggregationsGenerate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints
![Page 31: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/31.jpg)
3. Multi-level Labels3. Multi-level LabelsGenerate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints
![Page 32: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/32.jpg)
4. Factoring4. Factoring
• Observed cardinality:Observed cardinality:
– microfilm tablemicrofilm table
• Expected cardinality:Expected cardinality:
– genealogy ontologygenealogy ontology
Generate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints Check Cardinality ConstraintsCheck Cardinality Constraints
![Page 33: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/33.jpg)
Observed CardinalityObserved CardinalityGenerate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints [First Name] per [Family] = [First Name] per [Family] = 4545 / / 99 = = 4.674.67
. . .. . .
![Page 34: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/34.jpg)
Expected CardinalityExpected Cardinality
[First Name] per [Family] = 4.8 * 1 * 1 = [First Name] per [Family] = 4.8 * 1 * 1 = 4.84.8
Generate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints
![Page 35: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/35.jpg)
5. Ontological Similarity5. Ontological SimilarityGenerate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints Increase Confidence of Label Increase Confidence of Label
to Object Set Mappingsto Object Set Mappings
![Page 36: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/36.jpg)
6. Same Microfilm Roll6. Same Microfilm RollGenerate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints
• Microfilm from the same roll have Microfilm from the same roll have the same structure and relationships the same structure and relationships
• Generate the confidence values for Generate the confidence values for multiple tables from the same roll multiple tables from the same roll
• Take the average of the respective Take the average of the respective confidence values confidence values
![Page 37: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/37.jpg)
Verify ResultsVerify ResultsGenerate Confidences
Generate Confidences
EnforceConstraints
EnforceConstraints
VerifyResults
VerifyResults
![Page 38: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/38.jpg)
DatabaseDatabase
Full NameFull Name …
Generate Confidences
Generate Confidences
ApplyRules
ApplyRules
VerifyResults
VerifyResults
• Create SQL Insert statements to Create SQL Insert statements to store value cell coordinatesstore value cell coordinates
…
INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES
('('335,114,521,172335,114,521,172')') INSERT INTO Person (Full Name) VALUES INSERT INTO Person (Full Name) VALUES
('('335,173,521,231335,173,521,231')')…
![Page 39: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/39.jpg)
AlgorithmAlgorithm
SQL Insert Statements
SQL Insert Statements
XML Input File(Preprocessed Microfilm Image)
Genealogical Ontology
InputInput OutputOutputMethodMethod
Generate ConfidencesGenerate
Confidences
EnforceConstraints
EnforceConstraints
VerifyResultsVerifyResults
![Page 40: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/40.jpg)
Training Set ResultsTraining Set Results
RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy
Label Cell Describes Label Cell Describes
Value CellValue Cell100%100% 100%100% 100%100%
Value Cells in Same Value Cells in Same Row or ColumnRow or Column
100%100% 100%100% 100%100%
Multilevel LabelsMultilevel Labels 100%100% 100%100% 100%100%
Label Cells – Object Label Cells – Object Set MatchesSet Matches
74.45%74.45% 100%100% 84.65%84.65%
FactoringFactoring 100%100% 100%100% 100%100%
SQL FieldsSQL Fields 99.42%99.42% 100%100% 99.71%99.71%
![Page 41: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/41.jpg)
Ambiguous FactoringAmbiguous Factoring
![Page 42: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/42.jpg)
ExperimentsExperiments
• 75 Tables from 15 different 75 Tables from 15 different microfilm rollsmicrofilm rolls
• Precision, recall, and accuracyPrecision, recall, and accuracy– Populated SQL fieldsPopulated SQL fields– Each relationshipEach relationship
![Page 43: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/43.jpg)
Test Set ResultsTest Set Results
RelationshipRelationship PrecisionPrecision RecallRecall AccuracyAccuracy
Label Cell Describes Label Cell Describes
Value CellValue Cell100%100% 98.12 %98.12 % 98.12 %98.12 %
Value Cells in Same Value Cells in Same Row or ColumnRow or Column
100%100% 100%100% 100%100%
Multilevel LabelsMultilevel Labels 100%100% 99.67%99.67% 99.82%99.82%
Label Cells – Object Label Cells – Object Set MatchesSet Matches
84.98%84.98% 92.76%92.76% 88.1888.18%%
FactoringFactoring 100%100% 93.40%93.40% 93.47%93.47%
SQL FieldsSQL Fields 93.20%93.20% 92.41%92.41% 92.15%92.15%
![Page 44: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/44.jpg)
3 Success Examples3 Success Examples
1.1. Specialized RecordSpecialized Record
2.2. Ontology ConstraintsOntology Constraints
3.3. FactoringFactoring
![Page 45: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/45.jpg)
1. Specialized Records1. Specialized Records
![Page 46: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/46.jpg)
1. Specialized Records1. Specialized Records
INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Gender, Occupation, Race, Family_Identifier, Birth_Identifier) (1, '109,455,267,478', '314,456 ,336,479', '291,456,314,478', '505,457,637,480', '267,456,291,478', 1, 1)INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (2, 2)INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (3, 3)INSERT INTO MOTHER_CHILD (Mother_Identifier, Child_Identifier) (3, 1)INSERT INTO FATHER_CHILD (Father_Identifier, Child_Identifier) (2, 1)INSERT INTO EVENT (Event_Identifier, Location) (1, '894,460,997,483')INSERT INTO EVENT (Event_Identifier, Location) (2, '997,460,1076,483')INSERT INTO EVENT (Event_Identifier, Location) (3, '1076,461,1153,484')
![Page 47: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/47.jpg)
2. Ontology Constraints2. Ontology Constraints
![Page 48: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/48.jpg)
2. Ontology Constraints2. Ontology Constraints
INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Family_Identifier, Burial_Identifier) (1, '70,243,331,373', '620,243,687,370', 1, 1)INSERT INTO FAMILY (Family_Identifier, Location) (1, '331,243,508,372')INSERT INTO EVENT (Event_Identifier, Date) (1, '508,243,620,371')
INSERT INTO PERSON (Person_Identifier, Full_Name) (2,'687,241,861,372')
![Page 49: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/49.jpg)
3. Factoring3. Factoring
![Page 50: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/50.jpg)
3 Types of Errors3 Types of Errors
1.1. Ambiguous FactoringAmbiguous Factoring
2.2. Long Label NamesLong Label Names
3.3. Ambiguous ColumnsAmbiguous Columns
![Page 51: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/51.jpg)
2. Long Label Names2. Long Label Names
![Page 52: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/52.jpg)
3. Ambiguous Columns3. Ambiguous Columns
![Page 53: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/53.jpg)
ArtifactsArtifacts
• Tool in the Java programming language Tool in the Java programming language
• http://www.rdhd.byu.edu/
• Executable Jar FileExecutable Jar File
• Source CodeSource Code
• Input FilesInput Files
• DocumentationDocumentation
![Page 54: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/54.jpg)
Future WorkFuture Work
• Advanced natural language Advanced natural language processingprocessing
• Hand-written valuesHand-written values
• Machine learningMachine learning
![Page 55: Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.](https://reader038.fdocuments.us/reader038/viewer/2022103005/56649d805503460f94a65119/html5/thumbnails/55.jpg)
Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm
Kenneth Martin Tubbs Jr.Kenneth Martin Tubbs Jr.
A Thesis Presented to theDepartment of Computer Science
Brigham Young University