Post on 11-Jan-2016
description
Pattern Markup-LanguagePattern Markup-LanguageA tool for simplifying data extractionA tool for simplifying data extraction
from semi-structured sourcesfrom semi-structured sources
Jonathan Baker, Hilton CampbellJonathan Baker, Hilton Campbell,,
Jordan Crabtree, David W. EmbleyJordan Crabtree, David W. Embley
Pattern Markup LanguagePattern Markup Language 22
Many Sites with Genealogical Many Sites with Genealogical DataData
Pattern Markup LanguagePattern Markup Language 33
Pattern Markup LanguagePattern Markup Language 44
Pattern Markup LanguagePattern Markup Language 55
Structural PatternsStructural Patterns
Pattern Markup LanguagePattern Markup Language 66
Pattern Markup LanguagePattern Markup Language 77
Pattern Markup LanguagePattern Markup Language 88
Pattern Markup LanguagePattern Markup Language 99
Pattern Markup LanguagePattern Markup Language 1010
Regular Expression A
Programmer DefinedProgrammer DefinedRegular ExpressionsRegular Expressions
Pattern Markup LanguagePattern Markup Language 1111
Regular Expression B
Programmer DefinedProgrammer DefinedRegular ExpressionsRegular Expressions
Pattern Markup LanguagePattern Markup Language 1212
Regular Expression C
Programmer DefinedProgrammer DefinedRegular ExpressionsRegular Expressions
Pattern Markup LanguagePattern Markup Language 1313
Given NameBirth DateDeath Date Aliases
Which Relationships Which Relationships FoundFound??
Pattern Markup LanguagePattern Markup Language 1414
Person
Birth Death Names
Date Date Given Aliases
Simple Schema Simple Schema Represents RelationshipsRepresents Relationships
Pattern Markup LanguagePattern Markup Language 1515
Combine Schema andCombine Schema andRegular ExpressionsRegular Expressions
Person
Birth Death Names
Date Date Given Aliases
Regular Expression A Regular Expression B Regular Expression DRegular Expression C
Tree Represented by XML = Tree Represented by XML = PatMLPatML
Pattern Markup LanguagePattern Markup Language 1616
Person
Birth Death Names
Date Date Given Aliases
Regular Expression A Regular Expression B Regular Expression C Regular Expression D
Pattern Markup LanguagePattern Markup Language 1717
Person
Birth Death Names
Date Date Given Aliases
Regular Expression A Regular Expression B Regular Expression C Regular Expression D
Pattern Markup LanguagePattern Markup Language 1818
Person
Birth Death Names
Date Date Given Aliases
Regular Expression A Regular Expression B Regular Expression C Regular Expression D
Pattern Markup LanguagePattern Markup Language 1919
Person
Birth Death Names
Date Date Given Aliases
Regular Expression A Regular Expression B Regular Expression C Regular Expression D
Pattern Markup LanguagePattern Markup Language 2020
Person
Birth Death Names
Date Date Given Aliases
Regular Expression A Regular Expression B Regular Expression C Regular Expression D
Schema GeneratorEstablishes relationships
PatML Generation Tools
Pattern Markup LanguagePattern Markup Language 2121
Person
Birth Death Names
Date Date Given Aliases
Regular Expression A Regular Expression B Regular Expression C Regular Expression D
PatML EditorHelps write the regular expressions and establish which facts they match
PatML Generation Tools
Pattern Markup LanguagePattern Markup Language 2222
Pattern Markup LanguagePattern Markup Language 2323
Using PatML EditorUsing PatML Editor
Get your schema fileGet your schema file Browse for sample pageBrowse for sample page Add nodesAdd nodes Add expressionsAdd expressions See the highlights in sourceSee the highlights in source AdjustAdjust
Pattern Markup LanguagePattern Markup Language 2424
PatML EditorPatML EditorInterfaceInterface
Browser with rendered
sample page
Text area with sample
page source
Tree representingPatML structure
Pattern Markup LanguagePattern Markup Language 2525
Pattern Markup LanguagePattern Markup Language 2626
Fast and VersatileFast and Versatile
Regular sites can be integrated Regular sites can be integrated in hoursin hours
Adaptable to any type of Adaptable to any type of informationinformation
Pattern Markup LanguagePattern Markup Language 2727
Implementation to DateImplementation to Date
Genesis uses PatML files to search a Genesis uses PatML files to search a variety of sitesvariety of sites Searches TNG, Retrospect-GDS, Family Searches TNG, Retrospect-GDS, Family
Search, GedCom and Kansas GunslingersSearch, GedCom and Kansas Gunslingers Standardizes information for a common Standardizes information for a common
datamodeldatamodel Simultaneously searches other sites (in Simultaneously searches other sites (in
different formats) for people with similar different formats) for people with similar informationinformation
Pattern Markup LanguagePattern Markup Language 2828
ResultsResults
Pattern Markup LanguagePattern Markup Language 2929
Produced PatML that correctly extracts Produced PatML that correctly extracts data from TNG, RGDS, GedCom Sites, data from TNG, RGDS, GedCom Sites, and Kansas Gunslingersand Kansas Gunslingers
User Interface allows for improved User Interface allows for improved debugging environmentdebugging environment
~1/10 coding time with PatML ~1/10 coding time with PatML generation tools compared to similarly generation tools compared to similarly functioning hand coded parsersfunctioning hand coded parsers
ResultsResults
Pattern Markup LanguagePattern Markup Language 3030
LimitationsLimitations Sites must be recognizable with Sites must be recognizable with regular expressionsregular expressions
Even regular sites have page to Even regular sites have page to page HTML variationspage HTML variations
Programmer error with regular Programmer error with regular expressionsexpressions
Regular expression operations can be Regular expression operations can be slowslow
Pattern Markup LanguagePattern Markup Language 3131
Future workFuture work
Automatic regular expression Automatic regular expression generationgeneration
Parsing links to extract data on Parsing links to extract data on connected pagesconnected pages
Use in other applications and fieldsUse in other applications and fields XPath approachesXPath approaches