Automatic Extraction From and Reasoning About Genealogical Records: A Prototype

40
1 Automatic Extraction Automatic Extraction From and Reasoning About From and Reasoning About Genealogical Records: A Genealogical Records: A Prototype Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science

description

Automatic Extraction From and Reasoning About Genealogical Records: A Prototype. By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science. Digital Images – Human Index. - PowerPoint PPT Presentation

Transcript of Automatic Extraction From and Reasoning About Genealogical Records: A Prototype

Page 1: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

1

Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About

Genealogical Records: A Genealogical Records: A PrototypePrototype

By Charla J. Woodbury

A thesis submitted to the faculty ofBrigham Young University

in partial fulfillment of the requirements for the degree of

Master of Science

Page 2: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

22

Digital Images – Human Digital Images – Human IndexIndex

• Large number of competing family history websites•Digital images•Human indexes

• Researchers hunting through records and indexes to put families together

Page 3: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

33

ProblemProblem Large amounts of primary genealogical Large amounts of primary genealogical

datadata

Big projects to index and extract recordsBig projects to index and extract records

Two independent indexers and Two independent indexers and adjudicationadjudication

Millions of human hours used to index or Millions of human hours used to index or match records for names and familiesmatch records for names and families

Page 4: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

44

Automated Extraction Automated Extraction SolutionSolution

Create a specialized extraction Create a specialized extraction ontology to interpret and label ontology to interpret and label genealogical datagenealogical data

Add rules and logic thatAdd rules and logic that Label family roles - husband, daughter, Label family roles - husband, daughter,

etc.etc. Link family relationshipsLink family relationships

HUSBAND – WIFEHUSBAND – WIFE PARENT – CHILDPARENT – CHILD

Page 5: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

5

OutlineOutline1.1. Data PreparationData Preparation2.2. Ontology Extraction System Ontology Extraction System

(OntoES)(OntoES)3.3. OWL File and SWRL RulesOWL File and SWRL Rules4.4. SPARQL QueriesSPARQL Queries5.5. Experimental ResultsExperimental Results6.6. ConclusionsConclusions

5

Page 6: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

66

1. Data Preparation1. Data Preparation

Collect machine-readable records from Collect machine-readable records from three different countriesthree different countries

Format in HTML format for extractionFormat in HTML format for extraction

Prepare lexicons for names, places, etc.Prepare lexicons for names, places, etc.

Page 7: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

77

New England Vital Records New England Vital Records – Beverly, Massachusetts – Beverly, Massachusetts

1668-18491668-1849

Page 8: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

88

Danish Parish – Maglebye, Praesto

1646-1813

Page 9: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

99

English Parish – South English Parish – South Petherton, Somersetshire Petherton, Somersetshire

1574-19011574-1901

Page 10: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

1010

same day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts

SOUTH PETHERTON SOUTH PETHERTON MARRIAGES (from genuki)MARRIAGES (from genuki)

Page 11: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

11

Transcript of the original record

1576/1577 eodem die Nicholaus Patch Christinam Denman 26 Jan 1605 Richard Patch et Joanna Lavor 1613 Septembris 26 Johannes Elliott et Joanna Woodbery matrimonis

cominguntur 1615 Augusti 7 Thoms Prime et Maria Patch matrimonio cominguntur 1616/1617 Januarij 29 Wilhelmus Woodbery et Elizabetha Patch matrimonio

cominguntur 1620 Maij 2 : Wilhelmus Hillerd et Fortu: Patch 1622 Septembris 17 Nicholas Patch et Elizabetha Owsley matrimonio

cominguntur 1627/1628 Januarij 22 : Richardus Patch et Maria White matrimonio

cominguntur 1630/1631 Januarij 15 Andreas Elliott et Joanna Patch matrimonio

cominguntur 1639/1640 Februarij 12 Andreas Elliott et Joanna Pittes matrimonio

cominguntur

Page 12: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

1212

2. Ontology Extraction 2. Ontology Extraction SystemSystem

OntoESOntoES: automatically interpret and : automatically interpret and correctly label genealogical data correctly label genealogical data usingusing

Data framesData framesRegular expressions Regular expressions LexiconsLexiconsDate conversion methodsDate conversion methods

Page 13: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

1313

Marriage OntologyMarriage Ontology

Page 14: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

1414

Data Frame EditorData Frame Editor

Page 15: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

15

Regular expressions MARDATE

Value expressionEXAMPLE type 25-Sep 1613

(0\d|1\d|2\d|30|31|\d)-{Month}\.?\s*(\d\d\d\d)

Keyword expression (\b(md\.?|marry|marriage|married|maried|

wed|wedding)\b)

Page 16: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

1616

SampleSample MONTH MONTH LEXICONLEXICON

1Ober1Ober 7ber7ber 8ber8ber 9ber9ber aprapr aprilapril aprilisaprilis augaug augustaugust augustiaugusti augustusaugustus avravr avrilavril avrilisavrilis decdec decemberdecember

decembrdecembr decembredecembre decembridecembri febfeb febrfebr februarifebruari februaryfebruary janjan januarijjanuarij januaryjanuary juljul julijuli juliusjulius julyjuly junjun junejune

Page 17: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

1717

Object LevelObject Level

Page 18: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

1818

CANONICALIZATION CANONICALIZATION METHODSMETHODS

inside the ontologyinside the ontology Regularize date (Julian format: Regularize date (Julian format:

YYYYdddYYYYddd))

1620 2-May 1620 2-May →→ 16200931620093

Display stored Julian format as DD MMM YYYY

1620093 →→ 2 MAY 1620

Page 19: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

1919

Feast DatesFeast DatesDates expressed as a holy dayDates expressed as a holy day Fixed DatesFixed Dates

Christmas 1720 Christmas 1720 →→ 25 DEC 172025 DEC 1720

Moveable Dates around Easter Moveable Dates around Easter (36 possible Easter dates with leap year (36 possible Easter dates with leap year

variation)variation) 1723 Dnica Septuagesima1723 Dnica Septuagesima →→ 24 JAN 24 JAN

17231723

Same day as previous entrySame day as previous entry

Page 20: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

2020

Run Ontology Run Ontology InputInput

Ontology Ontology (Created with OntoES)(Created with OntoES) HTML dataHTML data (Hypertext Markup Language) (Hypertext Markup Language)

OutputOutput RDF databaseRDF database (Resource Description (Resource Description

Format)Format) OWL fileOWL file (Ontology Web Language) (Ontology Web Language)

Page 21: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

2121

Ontology WorkbenchOntology Workbench

Page 22: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

2222

Extracted MarriagesExtracted MarriagesBetDate MarDate NameM NameF NameU

same day 1576 Nicholas Patch

Christian Denman

26 JAN 1605 Richard Patch Joan Lavor

26 SEP 1613 John Elliott Joan

Woodbery7 AUG

1615 Thomas Prime Maria Parry

29 JAN 1616 William Woodbery Elizabeth

Patch2 MAY

1620 William Hillerd Fortu: Patch

17 SEP 1622 Nicholas Patch Elizabeth

Owlsey22 JAN

1627 Richard Patch Mary White

16 JAN 1630 Andrew Elliott Joan Patch

12 FEB 1639 Andrew Elliott Joan Pitts

Page 23: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

23

3. OWL File and SWRL 3. OWL File and SWRL RulesRules OWL HEADER

<owl:Class rdf:ID="MarriageRecord"/> <owl:Class rdf:ID="Person"/> <owl:Class rdf:ID="NameU"/> <owl:DatatypeProperty rdf:ID="NameUValue"> <rdfs:domain rdf:resource="#NameU"/> <rdfs:range rdf:resource="&xsd;string"/> </owl:DatatypeProperty>

PERSON - NAMEU <owl:ObjectProperty rdf:ID="Person-NameU"> <rdfs:domain rdf:resource="#Person"/> <rdfs:range rdf:resource="#NameU"/> <owl:inverseOf> <owl:ObjectProperty rdf:ID="NameU-Person"/> </owl:inverseOf> </owl:ObjectProperty>

Page 24: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

24

Sample RDF TriplesPerson_10 | sameAs | Person_10Person_10 | type | ThingPerson_10 | type | PersonNameU_0 | NameUValue | “Christian Denman”NameU_0 | sameAs | NameU_0NameU_0 | type | ThingNameU_0 |type | NameUNameM_4 | NameMValue | “Nicholas Patch”NameM_4 | sameAs | NameM_4NameM_4 | type | ThingNameM_4 |type | NameM

Page 25: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

2525

SWRL RulesSWRL Rules Define OWL ClassDefine OWL Class

Example – HusbandExample – Husband <owl:Class rdf:ID="Husband"/><owl:Class rdf:ID="Husband"/>

Define RuleDefine Rule Example – Person with male name is a Example – Person with male name is a

HusbandHusband Person-NameM(?x,?y) -> Husband(?x)Person-NameM(?x,?y) -> Husband(?x)

?x

?y

Page 26: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

26

Marriage – Person_10 to Person_4 Person_10

<Person rdf:ID="Person_10"> <Person-NameU rdf:resource="#NameU_0" /> </Person>

MarriageRecord_7 <MarriageRecord rdf:ID="MarriageRecord_7"> <MarriageRecord-Person rdf:resource="#Person_4" /> <MarriageRecord-Person rdf:resource="#Person_10" /> </rdf:MarriageRecord>

NameM_4 <NameM rdf:ID="NameM_4"> <NameMValue> Nicholas Patch</NameMValue> </NameM>

Person_4 <Person rdf:ID="Person_4"> <Person-NameM rdf:resource="#NameM_4" /> </Person>

Page 27: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

27

Rule HEAD in OWL file

<swrl:Imp rdf:ID="Def-Husband"> <swrl:head rdf:parseType=“Collection”>

<swrl:ClassAtom> <swrl:argument1 rdf:resource="#x"/> <swrl:classPredicate

rdf:resource="#Husband"/> </swrl:ClassAtom></swrl:head>

Page 28: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

28

Rule BODY in OWL file

<swrl:body><swrl:IndividualPropertyAtom>

<swrl:propertyPredicate rdf:resource="#Person-NameM"/>

<swrl:argument1 rdf:resource="#x"/> <swrl:argument2 rdf:resource="#y"/> </swrl:IndividualPropertyAtom>

</swrl:body>

</swrl:Imp>

Page 29: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

2929

Related RulesRelated Rules NameF is populated then value in NameU is NameF is populated then value in NameU is

HusbandHusbandPerson-NameF(?w,?v) Person-NameF(?w,?v) MarriageRecord-Person(?z,? MarriageRecord-Person(?z,?

w) w) MarriageRecord-Person(?z,?x) MarriageRecord-Person(?z,?x) Person-NameU(?x,?y) Person-NameU(?x,?y) -> Husband(?x)-> Husband(?x)

?x

?z

?w ?v?y

Page 30: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

3030

HusbandOf RuleHusbandOf Rule

Husband(?x) Husband(?x) Wife(?y) Wife(?y) MarriageRecord- MarriageRecord-Person(?z,?x)Person(?z,?x)

MarriageRecord-Person(?z,?y)MarriageRecord-Person(?z,?y) -> HusbandOf(?x,?y)-> HusbandOf(?x,?y)

Page 31: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

31

Auxiliary Name RulesAuxiliary Name Rules

31

NameM(?x) -> Name(?x)NameM(?x) -> Name(?x)NameF(?x) -> Name(?x)NameF(?x) -> Name(?x)NameU(?x) -> Name(?x)NameU(?x) -> Name(?x)

NameMValue(?x) -> NameValue(?x)NameMValue(?x) -> NameValue(?x)NameFValue(?x) -> NameValue(?x)NameFValue(?x) -> NameValue(?x)NameUValue(?x) -> NameValue(?x)NameUValue(?x) -> NameValue(?x)

Person-NameM(?x,?y) -> Person-Name(?x,?y)Person-NameM(?x,?y) -> Person-Name(?x,?y)Person-NameF(?x,?y) -> Person-Name(?x,?y)Person-NameF(?x,?y) -> Person-Name(?x,?y)Person-NameU(?x,?y) -> Person-Name(?x,?y)Person-NameU(?x,?y) -> Person-Name(?x,?y)

Page 32: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

3232

4.4. SPARQL Query SPARQL Query Who is Who is Husband ofHusband of Christian Christian

Denman?Denman?PREFIX : PREFIX : http://www.deg.byu.edu/ontology/Marriage#

SELECT ?HusbandSELECT ?HusbandWHEREWHERE{{ ?X :NameValue "Christian Denman" .?X :NameValue "Christian Denman" . ?Y :Person-Name ?X .?Y :Person-Name ?X . ?W :HusbandOf ?Y .?W :HusbandOf ?Y . ?W :Person-Name ?V .?W :Person-Name ?V . ?V :NameValue ?Husband?V :NameValue ?Husband}}

Page 33: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

3333

Query ResultsQuery ResultsHusbandHusband==============================================================================""Nicholas PatchNicholas Patch"^^http://www.w3.org/2001/XMLSchema#string"^^http://www.w3.org/2001/XMLSchema#string

Page 34: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

3434

Query ResultsQuery ResultsHusbandHusband==============================================================================""Nicholas PatchNicholas Patch"^^http://www.w3.org/2001/XMLSchema#string"^^http://www.w3.org/2001/XMLSchema#string

South Petherton Marriagessame day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts

“Nicholas Patch” because: NameValue(“Nicholas Patch”) and Name-NameValue(n1, “Nicholas Patch”) and Name(n1) is NameM(n1) and Person-NameM(p1, n1) NameValue(“Christian Denman”) and Name-NameValue(n2, “Christian Denman”) and Name(n2) is NameU(n2) and Person-NameU(p2, n2)Husband(p1) because: Person-NameM(p1, n1)Wife(p2) because: Person-NameU(p2, n2) and Person-MarriageRecord(p2, r1) and MarriageRecord-Person(r1, p1) and Person-NameM(p1, n1)HusbandOf(p1, p2) because: Husband(p1) and Wife(p1) and MarriageRecord-Person(r1, p1) and MarriageRecord-Person(r1, p1)

Page 35: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

3535

5. Experimental Results5. Experimental Results Extraction ResultsExtraction Results

American Extraction ProblemAmerican Extraction Problem

Rule ResultsRule Results

Page 36: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

3636

Extraction ResultsExtraction ResultsMARRIAGES

ENTITIES RECALL % ERROR

SPRECISION

English 188 594 588 99.0% 8 98.7%

American 608 1824 1630 89.4

% 34 98.0%

Danish 171 543 538 99.1% 10 98.2%

BIRTHS

English 3153 9489 9394 99.0% 61 99.4%

American 675 2055 1809 88.0

% 33 98.2%

Danish 677 2061 2042 99.1% 15 99.3%

  DEATHS

English 3458 8675 8589 99.0% 83 99.0%

American 510 1305 1148 88.0

% 28 97.6%

Danish 833 2113 2093 99.1% 19 99.1%

Page 37: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

3737

American DifficultyAmerican DifficultyBIRTHBIRTHWOODBURY, Charles Henry [Charles WOODBURY, Charles Henry [Charles

William, P. R. 4.], s. Henry [housewright. William, P. R. 4.], s. Henry [housewright. dup.] and Henrietta (Galloup), Dec. 4, 1845.dup.] and Henrietta (Galloup), Dec. 4, 1845.

Extra information inside brackets & Extra information inside brackets & parenthesesparentheses Charles WilliamCharles William – twin of Charles Henry – twin of Charles Henry Henry [housewrightHenry [housewright – identified as NAME – identified as NAME Henrietta (Galloup)Henrietta (Galloup) –identified as NAME –identified as NAME

Page 38: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

3838

Rules ResultsRules Results 100% Precision and Recall100% Precision and Recall

(Once rules are well-defined, the results are (Once rules are well-defined, the results are perfect.)perfect.)

Database SizeDatabase Size(The RDF database is much larger when rule (The RDF database is much larger when rule

triples are added.)triples are added.) NEW PROPERTIES – husband, wife, parent, NEW PROPERTIES – husband, wife, parent,

childchild NEW LINKSNEW LINKS

Page 39: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

39

Size Impact of Adding Rules

MARRIAGE 21 rules EVENT 30 rules

Triples OWL(# lines)

OWL File (kilobytes)

Triples OWL(# lines)

OWL File (kilobytes)

OWL File814 498 14 2232 1405 15

W/Rules1009 785 31 2983 1873 75

Difference195 287 17 751 468 60

Increase23.96% 57.63% 121.43% 33.65% 33.31% 400.00%

Page 40: Automatic Extraction From and Reasoning About Genealogical Records:  A Prototype

4040

6. Conclusions6. Conclusions Speed up data indexingSpeed up data indexing Make production of a full index easierMake production of a full index easier Ground the index in original documentsGround the index in original documents Provide for inferred factsProvide for inferred facts Simplify as well as augment record Simplify as well as augment record

searchsearch Help link records and form family Help link records and form family

groups and ancestral linesgroups and ancestral lines