Multi-column Substring Matching for Database Schema Translation
Automating Schema Matching
description
Transcript of Automating Schema Matching
![Page 1: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/1.jpg)
BYU 2003BYU Data Extraction Group
AutomatingSchema Matching
David W. Embley, Cui Tao, Li XuBrigham Young University
Funded by NSF
![Page 2: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/2.jpg)
BYU 2003BYU Data Extraction Group
Information ExchangeSource Target
InformationExtraction
SchemaMatching
Leveragethis …
… to dothis
![Page 3: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/3.jpg)
BYU 2003BYU Data Extraction Group
Presentation Outline
• Information Extraction• Schema Matching for Tables• Direct Schema Matching• Indirect Schema Matching• Conclusions and Future Work
![Page 4: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/4.jpg)
BYU 2003BYU Data Extraction Group
Information Extraction
![Page 5: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/5.jpg)
BYU 2003BYU Data Extraction Group
Extracting Pertinent Information from Documents
![Page 6: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/6.jpg)
BYU 2003BYU Data Extraction Group
A Conceptual-Modeling SolutionYear Price
Make Mileage
Model
Feature
PhoneNr
Extension
Car
hashas
has
has is for
has
has
has
1..*
0..1
1..*
1..* 1..*
1..*
1..*
1..*
0..1 0..10..1
0..1
0..1
0..1
0..*
1..*
![Page 7: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/7.jpg)
BYU 2003BYU Data Extraction Group
Car-Ads OntologyCar [->object];Car [0..1] has Year [1..*];Car [0..1] has Make [1..*];Car [0...1] has Model [1..*];Car [0..1] has Mileage [1..*];Car [0..*] has Feature [1..*];Car [0..1] has Price [1..*];PhoneNr [1..*] is for Car [0..*];PhoneNr [0..1] has Extension [1..*];Year matches [4]
constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … …End;
![Page 8: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/8.jpg)
BYU 2003BYU Data Extraction Group
Recognition and Extraction
Car Year Make Model Mileage Price PhoneNr0001 1989 Subaru SW $1900 (336)835-85970002 1998 Elantra (336)526-54440003 1994 HONDA ACCORD EX 100K (336)526-1081
Car Feature0001 Auto0001 AC0002 Black0002 4 door0002 tinted windows0002 Auto0002 pb0002 ps0002 cruise0002 am/fm0002 cassette stereo0002 a/c0003 Auto0003 jade green0003 gold
![Page 9: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/9.jpg)
BYU 2003BYU Data Extraction Group
Schema Matching for HTML Tables with Unknown Structure
Cui Tao
![Page 10: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/10.jpg)
BYU 2003BYU Data Extraction Group
Table-Schema Matching(Basic Idea)
• Many Tables on the Web• Ontology-Based Extraction
– Works well for unstructured or semistructured data– What about structured data – tables?
• Method– Form attribute-value pairs– Do extraction– Infer mappings from extraction patterns
![Page 11: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/11.jpg)
BYU 2003BYU Data Extraction Group
Problem: Different SchemasTarget Database Schema
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Different Source Table Schemas– {Run #, Yr, Make, Model, Tran, Color, Dr}– {Make, Model, Year, Colour, Price, Auto, Air Cond.,
AM/FM, CD}– {Vehicle, Distance, Price, Mileage}– {Year, Make, Model, Trim, Invoice/Retail, Engine,
Fuel Economy}
![Page 12: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/12.jpg)
BYU 2003BYU Data Extraction Group
Problem: Attribute is Value
![Page 13: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/13.jpg)
BYU 2003BYU Data Extraction Group
Problem: Attribute-Value is Value
? ?
![Page 14: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/14.jpg)
BYU 2003BYU Data Extraction Group
Problem: Value is not Value
![Page 15: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/15.jpg)
BYU 2003BYU Data Extraction Group
Problem: Implied Values
``````
![Page 16: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/16.jpg)
BYU 2003BYU Data Extraction Group
Problem: Missing Attributes
![Page 17: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/17.jpg)
BYU 2003BYU Data Extraction Group
Problem: Compound Attributes
![Page 18: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/18.jpg)
BYU 2003BYU Data Extraction Group
Problem: Factored Values
![Page 19: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/19.jpg)
BYU 2003BYU Data Extraction Group
Problem: Split Values
![Page 20: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/20.jpg)
BYU 2003BYU Data Extraction Group
Problem: Merged Values
![Page 21: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/21.jpg)
BYU 2003BYU Data Extraction Group
Problem: Values not of Interest
![Page 22: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/22.jpg)
BYU 2003BYU Data Extraction Group
Problem: Information Behind Links
Single-ColumnTable (formattedas list)
Tableextendingover severalpages
![Page 23: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/23.jpg)
BYU 2003BYU Data Extraction Group
Solution
• Form attribute-value pairs (adjust if necessary)• Do extraction• Infer mappings from extraction patterns
![Page 24: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/24.jpg)
BYU 2003BYU Data Extraction Group
Solution: Remove Internal Factoring
Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*
Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table
Legend
ACURA
ACURA
![Page 25: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/25.jpg)
BYU 2003BYU Data Extraction Group
Solution: Replace Boolean Values
Legend
ACURA
ACURA
β CD Table
Yes,
CD
CD
Yes,Yes,βAutoβAir CondβAM/FMYes,
AM/FMAM/FM
AM/FM
AM/FMAM/FM
AM/FM
Air Cond.Air Cond.
Air Cond.
Air Cond.
Auto
AutoAuto
Auto
![Page 26: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/26.jpg)
BYU 2003BYU Data Extraction Group
Solution: Form Attribute-Value Pairs
Legend
ACURA
ACURA
CD
CD
AM/FMAM/FM
AM/FM
AM/FMAM/FM
AM/FM
Air Cond.Air Cond.
Air Cond.
Air Cond.
Auto
AutoAuto
Auto
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >
![Page 27: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/27.jpg)
BYU 2003BYU Data Extraction Group
Solution: Adjust Attribute-Value Pairs
Legend
ACURA
ACURA
CD
CD
AM/FMAM/FM
AM/FM
AM/FMAM/FM
AM/FM
Air Cond.Air Cond.
Air Cond.
Air Cond.
Auto
AutoAuto
Auto
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>
![Page 28: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/28.jpg)
BYU 2003BYU Data Extraction Group
Solution: Do Extraction
Legend
ACURA
ACURA
CD
CD
AM/FMAM/FM
AM/FM
AM/FMAM/FM
AM/FM
Air Cond.Air Cond.
Air Cond.
Air Cond.
Auto
AutoAuto
Auto
![Page 29: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/29.jpg)
BYU 2003BYU Data Extraction Group
Solution: Infer Mappings
Legend
ACURA
ACURA
CD
CD
AM/FMAM/FM
AM/FM
AM/FMAM/FM
AM/FM
Air Cond.Air Cond.
Air Cond.
Air Cond.
Auto
AutoAuto
Auto
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Each row is a car. πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπMakeμ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπYearTable
Note: Mappings produce sets for attributes. Joining to form recordsis trivial because we have OIDs for table rows (e.g. for each Car).
![Page 30: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/30.jpg)
BYU 2003BYU Data Extraction Group
Solution: Do Extraction
Legend
ACURA
ACURA
CD
CD
AM/FMAM/FM
AM/FM
AM/FMAM/FM
AM/FM
Air Cond.Air Cond.
Air Cond.
Air Cond.
Auto
AutoAuto
Auto
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table
![Page 31: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/31.jpg)
BYU 2003BYU Data Extraction Group
Solution: Do Extraction
Legend
ACURA
ACURA
CD
CD
AM/FMAM/FM
AM/FM
AM/FMAM/FM
AM/FM
Air Cond.Air Cond.
Air Cond.
Air Cond.
Auto
AutoAuto
Auto
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
πPriceTable
![Page 32: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/32.jpg)
BYU 2003BYU Data Extraction Group
Solution: Do Extraction
Legend
ACURA
ACURA
CD
CD
AM/FMAM/FM
AM/FM
AM/FMAM/FM
AM/FM
Air Cond.Air Cond.
Air Cond.
Air Cond.
Auto
AutoAuto
Auto
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Yes,ρ Colour←Feature π ColourTable U ρ Auto←Feature π Auto β AutoTable U ρ Air Cond.←Feature π Air Cond.
β Air Cond.Table U ρ AM/FM←Feature π AM/FM β AM/FMTable U ρ CD←Featureπ CDβ CDTableYes, Yes, Yes,
![Page 33: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/33.jpg)
BYU 2003BYU Data Extraction Group
Experiment• Tables from 60 sites• 10 “training” tables• 50 test tables• 357 mappings (from all 60 sites)
– 172 direct mappings (same attribute and meaning)– 185 indirect mappings (29 attribute synonyms, 5 “Yes/No” columns,
68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)
![Page 34: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/34.jpg)
BYU 2003BYU Data Extraction Group
Results• 10 “training” tables
– 100% of the 57 mappings (no false mappings)– 94.6% of the values in linked pages (5.4% false declarations)
• 50 test tables– 94.7% of the 300 mappings (no false mappings)– On the bases of sampling 3,000 values in linked pages, we obtained 97%
recall and 86% precision• 16 missed mappings
– 4 partial (not all unions included)– 6 non-U.S. car-ads (unrecognized makes and models)– 2 U.S. unrecognized makes and models– 3 prices (missing $ or found MSRP instead)– 1 mileage (mileages less than 1,000)
![Page 35: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/35.jpg)
BYU 2003BYU Data Extraction Group
Direct Schema Matching
Li Xu
![Page 36: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/36.jpg)
BYU 2003BYU Data Extraction Group
Attribute Matchingfor Populated Schemas
• Central Idea: Exploit All Data & Metadata• Matching Possibilities (Facets)
– Attribute Names– Data-Value Characteristics– Expected Data Values– Data-Dictionary Information– Structural Properties
![Page 37: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/37.jpg)
BYU 2003BYU Data Extraction Group
Approach
• Target Schema T• Source Schema S• Framework
– Individual Facet Matching– Combining Facets– Best-First Match Iteration
![Page 38: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/38.jpg)
BYU 2003BYU Data Extraction Group
Example
Source Schema S
Car
Year
has
0:1
Make
has0:1
Modelhas
0:1
Cost
Style
has
has0:1 0:*
Year
has
0:1
Feature
has
0:* Costhas
0:1Car
Mileage
has
Phone
has
0:10:1
Modelhas
0:1
Target Schema T
Make
has0:1
Miles
has0:1
Year
Model
Make YearMake
ModelCar Car
Mileage Miles
![Page 39: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/39.jpg)
BYU 2003BYU Data Extraction Group
Individual Facet Matching
• Attribute Names• Data-Value Characteristics• Expected Data Values
![Page 40: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/40.jpg)
BYU 2003BYU Data Extraction Group
Attribute Names• Target and Source Attributes
– T : A – S : B
• WordNet• C4.5 Decision Tree: feature selection, trained on
schemas in DB books– f0: same word– f1: synonym– f2: sum of distances to a common hypernym root– f3: number of different common hypernym roots– f4: sum of the number of senses of A and B
![Page 41: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/41.jpg)
BYU 2003BYU Data Extraction Group
WordNet Rule
The number
of different common
hypernym roots of A
and B
The sum of distances of A and B to a
common hypernym
The sum of the
number of senses of A and B
![Page 42: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/42.jpg)
BYU 2003BYU Data Extraction Group
Confidence Measures
![Page 43: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/43.jpg)
BYU 2003BYU Data Extraction Group
Data-Value Characteristics
• C4.5 Decision Tree • Features
– Numeric data(Mean, variation, standard deviation, …)
– Alphanumeric data(String length, numeric ratio, space ratio)
![Page 44: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/44.jpg)
BYU 2003BYU Data Extraction Group
Confidence Measures
![Page 45: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/45.jpg)
BYU 2003BYU Data Extraction Group
Expected Data Values
• Target Schema T and Source Schema S– Regular expression recognizer for attribute A in T– Data instances for attribute B in S
• Hit Ratio = N'/N for (A, B) match– N' : number of B data instances recognized by the
regular expressions of A– N: number of B data instances
![Page 46: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/46.jpg)
BYU 2003BYU Data Extraction Group
Confidence Measures
![Page 47: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/47.jpg)
BYU 2003BYU Data Extraction Group
Combined Measures
Threshold: 0.5
10000000
0 0 0 0 0 01
00000
0 0 0 0100
0 0 0 000000
1000
0 010 0000
00
![Page 48: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/48.jpg)
BYU 2003BYU Data Extraction Group
Final Confidence Measures
00
0
![Page 49: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/49.jpg)
BYU 2003BYU Data Extraction Group
Experimental Results
• This schema, plus 6 other schemas– 32 matched attributes– 376 unmatched attributes
• Measures– Recall: 100%– Precision: 94%– F Measure: 97%
• False Positives– “Feature” ---”Color”– “Feature” ---”Body Type”
![Page 50: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/50.jpg)
BYU 2003BYU Data Extraction Group
Indirect Schema Matching
![Page 51: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/51.jpg)
BYU 2003BYU Data Extraction Group
Schema Matching
Source
Car
Year
Cost
Style
YearFeature
Cost
Phone
Target
Car
MilesMileage
Model
Make Make&
Model
Color
Body Type
![Page 52: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/52.jpg)
BYU 2003BYU Data Extraction Group
Mapping Generation• Direct Matches as described earlier:
– Attribute Names based on WordNet– Value Characteristics based on value lengths, averages, …– Expected Values based on regular-expression recognizers
• Indirect Matches:– Direct matches– Structure Evaluation
• Union• Selection• Decomposition• Composition
![Page 53: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/53.jpg)
BYU 2003BYU Data Extraction Group
Union and Selection
Car
Source
Year
Cost
Style
YearFeature
Cost
Phone
Target
Car
MilesMileage
Model
Make Make&
Model
Color
Body Type
![Page 54: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/54.jpg)
BYU 2003BYU Data Extraction Group
Decomposition and Composition
Car
Source
Year
Cost
Style
YearFeature
Cost
Phone
Target
Car
MilesMileage
Model
Make Make&
Model
Color
Body Type
![Page 55: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/55.jpg)
BYU 2003BYU Data Extraction Group
Structure
PO
POShipTo POBillTo POLines
City Street City Street Item
Count
Line Qty UoM
PurchaseOrder
DeliverToInvoiceTo
Items
ItemItemCount
ItemNumber
Quantity UnitOfMeasure
City Street
Address
Target Source
Example Taken From [MBR, VLDB’01]
![Page 56: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/56.jpg)
BYU 2003BYU Data Extraction Group
Structure(Nonlexical Matches)
PO
POShipTo POBillTo POLines
City Street City Street Item
Count
Line Qty UoM
PurchaseOrder
DeliverToInvoiceTo
Items
ItemCount
ItemNumber
Quantity UnitOfMeasure
City Street
Address
DeliverTo
Target Source
![Page 57: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/57.jpg)
BYU 2003BYU Data Extraction Group
Structure(Join over FD Relationship Sets, …)
PO
POBillTo POLines
City Street City Street Item
Count
Line Qty UoM
PurchaseOrder
InvoiceTo
Items
ItemCount
ItemNumber
Quantity UnitOfMeasure
City
Street City
Street
POShipTo DeliverTo
Target Source
![Page 58: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/58.jpg)
BYU 2003BYU Data Extraction Group
Structure(Lexical Matches)
PO
POBillTo POLines
City Street City Street Item
Count
Line Qty UoM
PurchaseOrder
InvoiceTo
Items
ItemCount
ItemNumber
Quantity
City
Street City
StreetCity
City
StreetStreet
City
City
Street
StreetCount
Count
Line QtyQuantity UnitOfMeasure
POShipTo DeliverTo
Target Source
![Page 59: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/59.jpg)
BYU 2003BYU Data Extraction Group
Experimental ResultsApplications
(Number of Schemes)Precision
(%)Recall(%)
F(%)
Correct FalsePositive
FalseNegative
Course Schedule (5) 98 93 96 119 2 9
Faculty Member (5) 100 100 100 140 0 0
Real Estate (5) 92 96 94 235 20 10
Data borrowed from Univ. of Washington [DDH, SIGMOD01]
Indirect Matches: 94% (precision, recall, F-measure)
Rough Comparison with U of W Results (Direct Matches only) * Course Schedule – Accuracy: ~71% * Faculty Members – Accuracy, ~92% * Real Estate (2 tests) – Accuracy: ~75%
![Page 60: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/60.jpg)
BYU 2003BYU Data Extraction Group
Conclusions and Future Work
![Page 61: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/61.jpg)
BYU 2003BYU Data Extraction Group
Conclusions• Table Mappings
– Tables: 94.7% (Recall); 100% (Precision)– Linked Text: ~97% (Recall); ~86% (Precision)
• Direct Attribute Matching– Matched 32 of 32: 100% Recall– 2 False Positives: 94% Precision
• Direct and Indirect Attribute Matching– Matched 494 of 513: 96% Recall– 22 False Positives: 96% Precision
www.deg.byu.edu
![Page 62: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/62.jpg)
BYU 2003BYU Data Extraction Group
Current & Future Work:Improve and Extend Indirect Matching
• Improve Object-Set Matching (e.g. Lex/non-Lex) • Add Relationship-Set Matching• Computations
![Page 63: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/63.jpg)
BYU 2003BYU Data Extraction Group
Current & Future Work:Tables Behind Forms
• Crawling the Hidden Web• Filling in Forms from Global Queries
![Page 64: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/64.jpg)
BYU 2003BYU Data Extraction Group
Current & Future Work:Developing Extraction Ontologies
• Creation from Knowledge Sources and Sample Application Pages– μK Ontology + Data Frames, Lexicons, …– RDF Ontologies
• User Creation by Example
![Page 65: Automating Schema Matching](https://reader035.fdocuments.us/reader035/viewer/2022081513/5681675e550346895ddc2e00/html5/thumbnails/65.jpg)
BYU 2003BYU Data Extraction Group
Current & Future Work:and Much More …
• Table Understanding• Microfilm Census Records• Generate Ontologies by Reading Tables• …
www.deg.byu.edu