Automating the Extraction of Genealogical Information from Historical Documents

34
Automating the Extraction of Genealogical Information from Historical Documents Aaron P. Stewart David W. Embley March 20, 2011

description

Automating the Extraction of Genealogical Information from Historical Documents. Aaron P. Stewart David W. Embley March 20, 2011. Part I: Vision. Current projects at the BYU Data Extraction Group. Goal: Search books for names. History of the Jones Family. George Jones. scanner. - PowerPoint PPT Presentation

Transcript of Automating the Extraction of Genealogical Information from Historical Documents

Page 1: Automating the Extraction of Genealogical Information from Historical Documents

Automating the Extraction of Genealogical Information from Historical Documents

Aaron P. StewartDavid W. EmbleyMarch 20, 2011

Page 2: Automating the Extraction of Genealogical Information from Historical Documents

Part I: Vision

Current projects at the BYU Data Extraction Group

Page 3: Automating the Extraction of Genealogical Information from Historical Documents

4

Goal: Search books for names

History of the Jones

Family scanner

George Jones

Page 4: Automating the Extraction of Genealogical Information from Historical Documents

Original Document

Page 5: Automating the Extraction of Genealogical Information from Historical Documents

Original Document

Page 6: Automating the Extraction of Genealogical Information from Historical Documents

Original Document

Page 7: Automating the Extraction of Genealogical Information from Historical Documents

Extracted Facts

NamesWilliam Gerard LathropMary ElyGerard LathropCharlotte Brackett JenningsNathan Tilestone JenningsMaria MillerMaria Jennings [Lathrop]Donald McKenzie [Lathrop]Anna Margaretta [Lathrop]Anna Catherine [Lathrop]

RelationshipsWilliam Gerard Lathrop : son of : Mary ElyWilliam Gerard Lathrop : son of : Gerard LathropWilliam Gerard Lathrop : m. : Charlotte Brackett JenningsCharlotte Brackett Jennings : dau. of : Nathan Tilestone JenningsCharlotte Brackett Jennings : dau. of : Maria Miller

Relationships (continued)Maria Jennings : child of : William Gerard LathropMaria Jennings : child of : Charlotte BrackettWilliam Gerard : child of : William Gerard LathropWilliam Gerard : child of : Charlotte BrackettDonald McKenzie : child of : William Gerard LathropDonald McKenzie : child of : Charlotte BrackettAnna Margaretta : child of : William Gerard LathropAnna Margaretta : child of : Charlotte BrackettAnna Catherine : child of : William Gerard LathropAnna Catherine : child of : Charlotte Brackett

Page 8: Automating the Extraction of Genealogical Information from Historical Documents

Inferred Facts

NamesWilliam Gerard LathropMary ElyGerard LathropCharlotte Brackett JenningsNathan Tilestone JenningsMaria MillerMaria Jennings [Lathrop]Donald McKenzie [Lathrop]Anna Margaretta [Lathrop]Anna Catherine [Lathrop]

RelationshipsWilliam Gerard Lathrop : son of : Mary ElyWilliam Gerard Lathrop : son of : Gerard LathropWilliam Gerard Lathrop : m. : Charlotte Brackett JenningsCharlotte Brackett Jennings : dau. of : Nathan Tilestone JenningsCharlotte Brackett Jennings : dau. of : Maria Miller

Relationships (continued)Maria Jennings : child of : William Gerard LathropMaria Jennings : child of : Charlotte BrackettWilliam Gerard : child of : William Gerard LathropWilliam Gerard : child of : Charlotte BrackettDonald McKenzie : child of : William Gerard LathropDonald McKenzie : child of : Charlotte BrackettAnna Margaretta : child of : William Gerard LathropAnna Margaretta : child of : Charlotte BrackettAnna Catherine : child of : William Gerard LathropAnna Catherine : child of : Charlotte Brackett

Inferred RelationshipsMaria Jennings : grandchild of : Mary ElyMaria Jennings : grandchild of : Gerard LathropMaria Jennings : grandchild of : Nathan Tilestone JenningsMaria Jennings : grandchild of : Maria MillerWilliam Gerard : grandchild of : Mary ElyWilliam Gerard : grandchild of : Gerard LathropWilliam Gerard : grandchild of : Nathan Tilestone JenningsWilliam Gerard : grandchild of : Maria Miller…

Page 9: Automating the Extraction of Genealogical Information from Historical Documents

Keywords

Chief Justice

Page 10: Automating the Extraction of Genealogical Information from Historical Documents

Queries

• Is there a chief justice related to Mary Ely?• Who are the sons of Gerard Lathrop?• Who are the grandchildren of Mary Ely?

Page 11: Automating the Extraction of Genealogical Information from Historical Documents

Part II: Implementation

Page 12: Automating the Extraction of Genealogical Information from Historical Documents

Ontology Editor

Page 13: Automating the Extraction of Genealogical Information from Historical Documents

Data Frame Editor

Page 14: Automating the Extraction of Genealogical Information from Historical Documents

Rule Editor

Page 15: Automating the Extraction of Genealogical Information from Historical Documents

Name Query

Page 16: Automating the Extraction of Genealogical Information from Historical Documents

Name Query

Page 17: Automating the Extraction of Genealogical Information from Historical Documents

Name Query

Page 18: Automating the Extraction of Genealogical Information from Historical Documents

Name Query

Page 19: Automating the Extraction of Genealogical Information from Historical Documents

HyKSS Indexing

Page 20: Automating the Extraction of Genealogical Information from Historical Documents

HyKSS Indexing

Page 21: Automating the Extraction of Genealogical Information from Historical Documents

Keyword Search

Page 22: Automating the Extraction of Genealogical Information from Historical Documents

Keyword Search

Page 23: Automating the Extraction of Genealogical Information from Historical Documents

Keyword Search

Page 24: Automating the Extraction of Genealogical Information from Historical Documents

Relationship Search

Page 25: Automating the Extraction of Genealogical Information from Historical Documents

Relationship Search

Page 26: Automating the Extraction of Genealogical Information from Historical Documents

Inferred Relationship Search

Page 27: Automating the Extraction of Genealogical Information from Historical Documents

Inferred Relationship Search

Maria Jennings is a grandchild of Mary ElyGrandchildOf(Maria Jennings, Mary Ely) :- Child-Parent(Maria Jennings, William Gerard Lathrop), Child-Parent(William Gerard Lathrop, Mary Ely)

Page 28: Automating the Extraction of Genealogical Information from Historical Documents

Part III: Improvements

Page 29: Automating the Extraction of Genealogical Information from Historical Documents

Extraction Tools

Page 30: Automating the Extraction of Genealogical Information from Historical Documents

Need Better Extraction Results

From Packer et al., http://deg.byu.edu/papers/Ancestry_NAACL_HLT_Paper.pdf

------- Lists -------

Page 31: Automating the Extraction of Genealogical Information from Historical Documents

Example of a Better Extractor(Margin Finder)

B\ liee (OCR error)

Buekman (OCR error)

Jobsph (OCR error)

Baseline errors

Baseline errors

Uuckkman (OCR error)

Charles. (OCR error)

Page 32: Automating the Extraction of Genealogical Information from Historical Documents

Example of a Better Extractor(Margin Finder)

LEVEL 1

LEVEL 1

LEVEL 1LEVEL 1LEVEL 1LEVEL 1

LEVEL 1LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1

LEVEL 1LEVEL 1

LEVEL 1LEVEL 1

LEVEL 1

LEVEL 2

LEVEL 2

LEVEL 2

Page 33: Automating the Extraction of Genealogical Information from Historical Documents

Need Annotation Tools

Page 34: Automating the Extraction of Genealogical Information from Historical Documents

Credits

• Ontology Editor – Numerous past students• Data Frame Editor – Numerous past students• Rule Editor – Nathan Tate• Hybrid Keyword and Semantic Search (HyKSS)

– Andrew Zitzelberger

• This presentation contains both actual screenshots and mock-ups of projected results