Automatic Extraction From and Reasoning About Genealogical Records: A Prototype

Post on 19-Mar-2016

48 views 0 download

Tags:

description

Automatic Extraction From and Reasoning About Genealogical Records: A Prototype. By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science. Digital Images – Human Index. - PowerPoint PPT Presentation

Transcript of Automatic Extraction From and Reasoning About Genealogical Records: A Prototype

1

Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About

Genealogical Records: A Genealogical Records: A PrototypePrototype

By Charla J. Woodbury

A thesis submitted to the faculty ofBrigham Young University

in partial fulfillment of the requirements for the degree of

Master of Science

22

Digital Images – Human Digital Images – Human IndexIndex

• Large number of competing family history websites•Digital images•Human indexes

• Researchers hunting through records and indexes to put families together

33

ProblemProblem Large amounts of primary genealogical Large amounts of primary genealogical

datadata

Big projects to index and extract recordsBig projects to index and extract records

Two independent indexers and Two independent indexers and adjudicationadjudication

Millions of human hours used to index or Millions of human hours used to index or match records for names and familiesmatch records for names and families

44

Automated Extraction Automated Extraction SolutionSolution

Create a specialized extraction Create a specialized extraction ontology to interpret and label ontology to interpret and label genealogical datagenealogical data

Add rules and logic thatAdd rules and logic that Label family roles - husband, daughter, Label family roles - husband, daughter,

etc.etc. Link family relationshipsLink family relationships

HUSBAND – WIFEHUSBAND – WIFE PARENT – CHILDPARENT – CHILD

5

OutlineOutline1.1. Data PreparationData Preparation2.2. Ontology Extraction System Ontology Extraction System

(OntoES)(OntoES)3.3. OWL File and SWRL RulesOWL File and SWRL Rules4.4. SPARQL QueriesSPARQL Queries5.5. Experimental ResultsExperimental Results6.6. ConclusionsConclusions

5

66

1. Data Preparation1. Data Preparation

Collect machine-readable records from Collect machine-readable records from three different countriesthree different countries

Format in HTML format for extractionFormat in HTML format for extraction

Prepare lexicons for names, places, etc.Prepare lexicons for names, places, etc.

77

New England Vital Records New England Vital Records – Beverly, Massachusetts – Beverly, Massachusetts

1668-18491668-1849

88

Danish Parish – Maglebye, Praesto

1646-1813

99

English Parish – South English Parish – South Petherton, Somersetshire Petherton, Somersetshire

1574-19011574-1901

1010

same day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts

SOUTH PETHERTON SOUTH PETHERTON MARRIAGES (from genuki)MARRIAGES (from genuki)

11

Transcript of the original record

1576/1577 eodem die Nicholaus Patch Christinam Denman 26 Jan 1605 Richard Patch et Joanna Lavor 1613 Septembris 26 Johannes Elliott et Joanna Woodbery matrimonis

cominguntur 1615 Augusti 7 Thoms Prime et Maria Patch matrimonio cominguntur 1616/1617 Januarij 29 Wilhelmus Woodbery et Elizabetha Patch matrimonio

cominguntur 1620 Maij 2 : Wilhelmus Hillerd et Fortu: Patch 1622 Septembris 17 Nicholas Patch et Elizabetha Owsley matrimonio

cominguntur 1627/1628 Januarij 22 : Richardus Patch et Maria White matrimonio

cominguntur 1630/1631 Januarij 15 Andreas Elliott et Joanna Patch matrimonio

cominguntur 1639/1640 Februarij 12 Andreas Elliott et Joanna Pittes matrimonio

cominguntur

1212

2. Ontology Extraction 2. Ontology Extraction SystemSystem

OntoESOntoES: automatically interpret and : automatically interpret and correctly label genealogical data correctly label genealogical data usingusing

Data framesData framesRegular expressions Regular expressions LexiconsLexiconsDate conversion methodsDate conversion methods

1313

Marriage OntologyMarriage Ontology

1414

Data Frame EditorData Frame Editor

15

Regular expressions MARDATE

Value expressionEXAMPLE type 25-Sep 1613

(0\d|1\d|2\d|30|31|\d)-{Month}\.?\s*(\d\d\d\d)

Keyword expression (\b(md\.?|marry|marriage|married|maried|

wed|wedding)\b)

1616

SampleSample MONTH MONTH LEXICONLEXICON

1Ober1Ober 7ber7ber 8ber8ber 9ber9ber aprapr aprilapril aprilisaprilis augaug augustaugust augustiaugusti augustusaugustus avravr avrilavril avrilisavrilis decdec decemberdecember

decembrdecembr decembredecembre decembridecembri febfeb febrfebr februarifebruari februaryfebruary janjan januarijjanuarij januaryjanuary juljul julijuli juliusjulius julyjuly junjun junejune

1717

Object LevelObject Level

1818

CANONICALIZATION CANONICALIZATION METHODSMETHODS

inside the ontologyinside the ontology Regularize date (Julian format: Regularize date (Julian format:

YYYYdddYYYYddd))

1620 2-May 1620 2-May →→ 16200931620093

Display stored Julian format as DD MMM YYYY

1620093 →→ 2 MAY 1620

1919

Feast DatesFeast DatesDates expressed as a holy dayDates expressed as a holy day Fixed DatesFixed Dates

Christmas 1720 Christmas 1720 →→ 25 DEC 172025 DEC 1720

Moveable Dates around Easter Moveable Dates around Easter (36 possible Easter dates with leap year (36 possible Easter dates with leap year

variation)variation) 1723 Dnica Septuagesima1723 Dnica Septuagesima →→ 24 JAN 24 JAN

17231723

Same day as previous entrySame day as previous entry

2020

Run Ontology Run Ontology InputInput

Ontology Ontology (Created with OntoES)(Created with OntoES) HTML dataHTML data (Hypertext Markup Language) (Hypertext Markup Language)

OutputOutput RDF databaseRDF database (Resource Description (Resource Description

Format)Format) OWL fileOWL file (Ontology Web Language) (Ontology Web Language)

2121

Ontology WorkbenchOntology Workbench

2222

Extracted MarriagesExtracted MarriagesBetDate MarDate NameM NameF NameU

same day 1576 Nicholas Patch

Christian Denman

26 JAN 1605 Richard Patch Joan Lavor

26 SEP 1613 John Elliott Joan

Woodbery7 AUG

1615 Thomas Prime Maria Parry

29 JAN 1616 William Woodbery Elizabeth

Patch2 MAY

1620 William Hillerd Fortu: Patch

17 SEP 1622 Nicholas Patch Elizabeth

Owlsey22 JAN

1627 Richard Patch Mary White

16 JAN 1630 Andrew Elliott Joan Patch

12 FEB 1639 Andrew Elliott Joan Pitts

23

3. OWL File and SWRL 3. OWL File and SWRL RulesRules OWL HEADER

<owl:Class rdf:ID="MarriageRecord"/> <owl:Class rdf:ID="Person"/> <owl:Class rdf:ID="NameU"/> <owl:DatatypeProperty rdf:ID="NameUValue"> <rdfs:domain rdf:resource="#NameU"/> <rdfs:range rdf:resource="&xsd;string"/> </owl:DatatypeProperty>

PERSON - NAMEU <owl:ObjectProperty rdf:ID="Person-NameU"> <rdfs:domain rdf:resource="#Person"/> <rdfs:range rdf:resource="#NameU"/> <owl:inverseOf> <owl:ObjectProperty rdf:ID="NameU-Person"/> </owl:inverseOf> </owl:ObjectProperty>

24

Sample RDF TriplesPerson_10 | sameAs | Person_10Person_10 | type | ThingPerson_10 | type | PersonNameU_0 | NameUValue | “Christian Denman”NameU_0 | sameAs | NameU_0NameU_0 | type | ThingNameU_0 |type | NameUNameM_4 | NameMValue | “Nicholas Patch”NameM_4 | sameAs | NameM_4NameM_4 | type | ThingNameM_4 |type | NameM

2525

SWRL RulesSWRL Rules Define OWL ClassDefine OWL Class

Example – HusbandExample – Husband <owl:Class rdf:ID="Husband"/><owl:Class rdf:ID="Husband"/>

Define RuleDefine Rule Example – Person with male name is a Example – Person with male name is a

HusbandHusband Person-NameM(?x,?y) -> Husband(?x)Person-NameM(?x,?y) -> Husband(?x)

?x

?y

26

Marriage – Person_10 to Person_4 Person_10

<Person rdf:ID="Person_10"> <Person-NameU rdf:resource="#NameU_0" /> </Person>

MarriageRecord_7 <MarriageRecord rdf:ID="MarriageRecord_7"> <MarriageRecord-Person rdf:resource="#Person_4" /> <MarriageRecord-Person rdf:resource="#Person_10" /> </rdf:MarriageRecord>

NameM_4 <NameM rdf:ID="NameM_4"> <NameMValue> Nicholas Patch</NameMValue> </NameM>

Person_4 <Person rdf:ID="Person_4"> <Person-NameM rdf:resource="#NameM_4" /> </Person>

27

Rule HEAD in OWL file

<swrl:Imp rdf:ID="Def-Husband"> <swrl:head rdf:parseType=“Collection”>

<swrl:ClassAtom> <swrl:argument1 rdf:resource="#x"/> <swrl:classPredicate

rdf:resource="#Husband"/> </swrl:ClassAtom></swrl:head>

28

Rule BODY in OWL file

<swrl:body><swrl:IndividualPropertyAtom>

<swrl:propertyPredicate rdf:resource="#Person-NameM"/>

<swrl:argument1 rdf:resource="#x"/> <swrl:argument2 rdf:resource="#y"/> </swrl:IndividualPropertyAtom>

</swrl:body>

</swrl:Imp>

2929

Related RulesRelated Rules NameF is populated then value in NameU is NameF is populated then value in NameU is

HusbandHusbandPerson-NameF(?w,?v) Person-NameF(?w,?v) MarriageRecord-Person(?z,? MarriageRecord-Person(?z,?

w) w) MarriageRecord-Person(?z,?x) MarriageRecord-Person(?z,?x) Person-NameU(?x,?y) Person-NameU(?x,?y) -> Husband(?x)-> Husband(?x)

?x

?z

?w ?v?y

3030

HusbandOf RuleHusbandOf Rule

Husband(?x) Husband(?x) Wife(?y) Wife(?y) MarriageRecord- MarriageRecord-Person(?z,?x)Person(?z,?x)

MarriageRecord-Person(?z,?y)MarriageRecord-Person(?z,?y) -> HusbandOf(?x,?y)-> HusbandOf(?x,?y)

31

Auxiliary Name RulesAuxiliary Name Rules

31

NameM(?x) -> Name(?x)NameM(?x) -> Name(?x)NameF(?x) -> Name(?x)NameF(?x) -> Name(?x)NameU(?x) -> Name(?x)NameU(?x) -> Name(?x)

NameMValue(?x) -> NameValue(?x)NameMValue(?x) -> NameValue(?x)NameFValue(?x) -> NameValue(?x)NameFValue(?x) -> NameValue(?x)NameUValue(?x) -> NameValue(?x)NameUValue(?x) -> NameValue(?x)

Person-NameM(?x,?y) -> Person-Name(?x,?y)Person-NameM(?x,?y) -> Person-Name(?x,?y)Person-NameF(?x,?y) -> Person-Name(?x,?y)Person-NameF(?x,?y) -> Person-Name(?x,?y)Person-NameU(?x,?y) -> Person-Name(?x,?y)Person-NameU(?x,?y) -> Person-Name(?x,?y)

3232

4.4. SPARQL Query SPARQL Query Who is Who is Husband ofHusband of Christian Christian

Denman?Denman?PREFIX : PREFIX : http://www.deg.byu.edu/ontology/Marriage#

SELECT ?HusbandSELECT ?HusbandWHEREWHERE{{ ?X :NameValue "Christian Denman" .?X :NameValue "Christian Denman" . ?Y :Person-Name ?X .?Y :Person-Name ?X . ?W :HusbandOf ?Y .?W :HusbandOf ?Y . ?W :Person-Name ?V .?W :Person-Name ?V . ?V :NameValue ?Husband?V :NameValue ?Husband}}

3333

Query ResultsQuery ResultsHusbandHusband==============================================================================""Nicholas PatchNicholas Patch"^^http://www.w3.org/2001/XMLSchema#string"^^http://www.w3.org/2001/XMLSchema#string

3434

Query ResultsQuery ResultsHusbandHusband==============================================================================""Nicholas PatchNicholas Patch"^^http://www.w3.org/2001/XMLSchema#string"^^http://www.w3.org/2001/XMLSchema#string

South Petherton Marriagessame day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts

“Nicholas Patch” because: NameValue(“Nicholas Patch”) and Name-NameValue(n1, “Nicholas Patch”) and Name(n1) is NameM(n1) and Person-NameM(p1, n1) NameValue(“Christian Denman”) and Name-NameValue(n2, “Christian Denman”) and Name(n2) is NameU(n2) and Person-NameU(p2, n2)Husband(p1) because: Person-NameM(p1, n1)Wife(p2) because: Person-NameU(p2, n2) and Person-MarriageRecord(p2, r1) and MarriageRecord-Person(r1, p1) and Person-NameM(p1, n1)HusbandOf(p1, p2) because: Husband(p1) and Wife(p1) and MarriageRecord-Person(r1, p1) and MarriageRecord-Person(r1, p1)

3535

5. Experimental Results5. Experimental Results Extraction ResultsExtraction Results

American Extraction ProblemAmerican Extraction Problem

Rule ResultsRule Results

3636

Extraction ResultsExtraction ResultsMARRIAGES

ENTITIES RECALL % ERROR

SPRECISION

English 188 594 588 99.0% 8 98.7%

American 608 1824 1630 89.4

% 34 98.0%

Danish 171 543 538 99.1% 10 98.2%

BIRTHS

English 3153 9489 9394 99.0% 61 99.4%

American 675 2055 1809 88.0

% 33 98.2%

Danish 677 2061 2042 99.1% 15 99.3%

  DEATHS

English 3458 8675 8589 99.0% 83 99.0%

American 510 1305 1148 88.0

% 28 97.6%

Danish 833 2113 2093 99.1% 19 99.1%

3737

American DifficultyAmerican DifficultyBIRTHBIRTHWOODBURY, Charles Henry [Charles WOODBURY, Charles Henry [Charles

William, P. R. 4.], s. Henry [housewright. William, P. R. 4.], s. Henry [housewright. dup.] and Henrietta (Galloup), Dec. 4, 1845.dup.] and Henrietta (Galloup), Dec. 4, 1845.

Extra information inside brackets & Extra information inside brackets & parenthesesparentheses Charles WilliamCharles William – twin of Charles Henry – twin of Charles Henry Henry [housewrightHenry [housewright – identified as NAME – identified as NAME Henrietta (Galloup)Henrietta (Galloup) –identified as NAME –identified as NAME

3838

Rules ResultsRules Results 100% Precision and Recall100% Precision and Recall

(Once rules are well-defined, the results are (Once rules are well-defined, the results are perfect.)perfect.)

Database SizeDatabase Size(The RDF database is much larger when rule (The RDF database is much larger when rule

triples are added.)triples are added.) NEW PROPERTIES – husband, wife, parent, NEW PROPERTIES – husband, wife, parent,

childchild NEW LINKSNEW LINKS

39

Size Impact of Adding Rules

MARRIAGE 21 rules EVENT 30 rules

Triples OWL(# lines)

OWL File (kilobytes)

Triples OWL(# lines)

OWL File (kilobytes)

OWL File814 498 14 2232 1405 15

W/Rules1009 785 31 2983 1873 75

Difference195 287 17 751 468 60

Increase23.96% 57.63% 121.43% 33.65% 33.31% 400.00%

4040

6. Conclusions6. Conclusions Speed up data indexingSpeed up data indexing Make production of a full index easierMake production of a full index easier Ground the index in original documentsGround the index in original documents Provide for inferred factsProvide for inferred facts Simplify as well as augment record Simplify as well as augment record

searchsearch Help link records and form family Help link records and form family

groups and ancestral linesgroups and ancestral lines