Post on 20-Jan-2016
Using linked data to interpret tables
Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi
University of Maryland, Baltimore County November 8, 2010
1
Interpreting a table
Name Team Position Height
Michael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
http://dbpedia.org/class/yago/NationalBasketballAssociationTeams
http://dbpedia.org/class/yago/NationalBasketballAssociationTeams
http://dbpedia.org/resource/Allen_Iversonhttp://dbpedia.org/resource/Allen_Iverson Map numbers as values of properties
Map numbers as values of properties
dbprop:teamdbprop:team
Interpreting a table
Name Team Position Height
Michael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .
"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .
"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .
"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .
@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .
"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .
"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .
"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .
Use Cases
Name Team Position Height
Michael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
Name Team Position Height
Michael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
Name Team Position Height
Michael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
Name Team Position Height
Michael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
Name Team Position Height
Michael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
Intelligent querying over data
Create a ‘Semantic’ knowledge-base
Use CasesName Team Position Height
Michael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .
"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .
"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .
"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .
@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .
"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .
"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .
"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .
Data Integration
Search / Query over tables
Name Team Position Height
Michael Jordan Chicago Shooting guard 1.98
Allen Iverson Philadelphia Point guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power forward 2.11
Confirm/Verify existing knowledgeAdd new knowledge to the LOD cloud
Convert legacy data into Semantic Web formats
Motivation and Related Work
We are laying a strong foundation for the Semantic Web …
… but an old problem haunts us …
Chicken ? Egg ? … No Chicken ?
• ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008)
• 305,632 Datasets available as CSV or spreadsheets on Data.gov (US) + 7 Other nations establishing open data
• Where is structured data ?
Automate the process
• We need systems that can generate data from existing sources
• Not practical for humans to encode all this into RDF manually
Related Work
• Database to Ontology mapping (Barrasa, scar Corcho, & Gmez-prez 2004), (Hu & Qu 2007), (Papapanagiotou et al. 2006), and (Lawrence 2004)
• Mapping Relational databases to RDF [W3C working group – RDB2RDF]
Related Work
• Mapping spreadsheets to RDF [RDF123, XLWrap]
• Practical and helpful systems but … – Require significant manual work– Do not generate linked data
• Interpreting web tables to answer complex search queries over the web tables (Limaye et al. 2010)
T2LD Framework
Predict Class for Columns
Predict Class for Columns
Linking the table cells
Linking the table cells
Identify and Discover relations
Identify and Discover relations
T2LD Framework
T2LD Framework
Predict Class for Columns
Predict Class for Columns
Linking the table cells
Linking the table cells
Identify and Discover relations
Identify and Discover relations
Predicting Class Labels for column
Team
Chicago
Philadelphia
Houston
San Antonio
Class
Instance
Class for the column
Class 1
Class 2
Class 3
Class 4
Knowledge Base
Yago
Wikitology1 – A hybrid knowledge base where structured data meets unstructured data
1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation
Querying the Knowledge–Base
1. Chicago Bulls2. Chicago3. Judy Chicago
1. Chicago Bulls2. Chicago3. Judy Chicago
1. Philadelphia2. Philadelphia 76ers3. Philadelphia (film)
1. Philadelphia2. Philadelphia 76ers3. Philadelphia (film)
1. Houston Rockets2. Houston3. Allan Houston
1. Houston Rockets2. Houston3. Allan Houston
{dbpedia-owl:Place,dbpedia-owl:City,yago:WomenArtist,yago:LivingPeople,yago:NationalBasketballAssociationTeams }
Types
{dbpedia-owl:Place, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film,yago:NationalBasketballAssociationTeams …. ….. ….. }
{……………………………………………………………. }
Team
Chicago
Philadelphia
Houston
San Antonio
Scoring the classesPossible Classes for the column - dbpedia-owl:Placedbpedia-owl:Cityyago:WomenArtistyago:LivingPeopleyago:NationalBasketballAssociationTeamsdbpedia-owl:PopulatedPlacedbpedia-owl:Film………
Possible Classes for the column - dbpedia-owl:Placedbpedia-owl:Cityyago:WomenArtistyago:LivingPeopleyago:NationalBasketballAssociationTeamsdbpedia-owl:PopulatedPlacedbpedia-owl:Film………
[Chicago, dbpedia-owl:City][Philadelphia, dbpedia-owl:City][Houston, dbpedia-owl:City] ….….[Chicago,dbpedia-owl:Film][Philadelphia,dbpedia-owl:Film]………
[Chicago, dbpedia-owl:City][Philadelphia, dbpedia-owl:City][Houston, dbpedia-owl:City] ….….[Chicago,dbpedia-owl:Film][Philadelphia,dbpedia-owl:Film]………
E.g. Processing class – “Chicago,yago:NationalBasketballAssociationTeams”
String Chicago: (R = 1) Chicago Bulls {yago:NationalBasketballAssociationTeams} [PR = 6](R = 2) Chicago {dbpedia-owl:PopulatedPlace, dbpedia-owl:City} [PR = 5](R = 3) Judy Chicago {yago:WomenArtist,yago:LivingPeople} [PR = 4]
Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank)[Chicago, yago:NationalBasketballAssociationTeams] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892
E.g. Processing class – “Chicago,yago:NationalBasketballAssociationTeams”
String Chicago: (R = 1) Chicago Bulls {yago:NationalBasketballAssociationTeams} [PR = 6](R = 2) Chicago {dbpedia-owl:PopulatedPlace, dbpedia-owl:City} [PR = 5](R = 3) Judy Chicago {yago:WomenArtist,yago:LivingPeople} [PR = 4]
Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank)[Chicago, yago:NationalBasketballAssociationTeams] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892
T2LD Framework
Predict Class for Columns
Predict Class for Columns
Linking the table cells
Linking the table cells
Identify and Discover relations
Identify and Discover relations
Machine Learning based Approach
Table Cell + Column Header + Row Data
+ Column Type
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Select the highest ranked entity
A second classifier decides whether to
link or not
A second classifier decides whether to
link or not
Link to “NIL”Link to “NIL”Link to the top
ranked instanceLink to the top
ranked instance
Learning to Rank
• We trained a SVMrank classifier which learnt to rank entities within a given set
Feature VectorFeature Vector
Similarity MeasuresSimilarity Measures
Popularity MeasuresPopularity Measures
• Levenshtein distance• Dice Score• Levenshtein distance• Dice Score
• Wikitology Score• PageRank• Page Length
• Wikitology Score• PageRank• Page Length
“To Link or not to Link … ’’
• A second SVM classifier
• Feature vector included the feature vector of the top ranked entity and additional two features –
– The SVMrank score of the top ranked entity– The difference in scores between the top two
ranked entities
T2LD Framework
Predict Class for Columns
Predict Class for Columns
Linking the table cells
Linking the table cells
Identify and Discover relations
Identify and Discover relations
Identify Relations
Name
Michael Jordan
Allen Iverson
Yao Ming
Tim Duncan
Team
Chicago
Philadelphia
Houston
San Antonio
Rel ‘A’Rel ‘A’
Rel ‘A’
Rel ‘A’, ‘C’
Rel ‘A’, ‘B’, ‘C’
Rel ‘A’, ‘B’
Relation between columns
Michael Jordan - Chicago
Allen Iverson - Philadelphia
Yao Ming - Houston
Michael Jordan - Chicago
Allen Iverson - Philadelphia
Yao Ming - Houston
dbprop:teamdbprop:team
dbprop:teamdbprop:draftTeam
dbprop:teamdbprop:draftTeam
dbprop:teamdbprop:team
dbprop:team dbprop:draftTeam
dbprop:team dbprop:draftTeam
Candidate relationsCandidate relations
Scoring the relations
Michael Jordan - Chicago
Allen Iverson – Philadelphia
Yao Ming - Houston
Michael Jordan - Chicago
Allen Iverson – Philadelphia
Yao Ming - Houston
dbprop:teamdbprop:team
dbprop:team dbprop:draftTeam
dbprop:team dbprop:draftTeam
dbprop:teamdbprop:team
Candidates: dbprop:team
dbprop:draftTeam
Candidates: dbprop:team
dbprop:draftTeam
dbprop:draftTeamScore: 0dbprop:draftTeamScore: 0
dbprop:draftTeam
Score:1
dbprop:draftTeam
Score:1
dbprop:teamScore:3dbprop:teamScore:3
T2LD Framework
Predict Class for Columns
Predict Class for Columns
Linking the table cells
Linking the table cells
Identify and Discover relations
Identify and Discover relations
Annotating web tables for the Semantic Web
Table as linked RDF
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .
"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .
"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .
"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .
"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .
"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .
"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .
“Team”@en is rdfs:label of dbpedia-owl:Team .“Team” is the common / human name for the class dbpedia-owl:Team
“Team”@en is rdfs:label of dbpedia-owl:Team .“Team” is the common / human name for the class dbpedia-owl:Team
dbpedia:Chicago_Bulls a yago:NationalBasketballAssociationTeams .dbpedia:Chicago_Bulls is a type (instance) yago:NationalBasketballAssociationTeams
dbpedia:Chicago_Bulls a yago:NationalBasketballAssociationTeams .dbpedia:Chicago_Bulls is a type (instance) yago:NationalBasketballAssociationTeams
Results
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
* The number in the brackets indicates # excluding columns that contained numbers
Dataset summary
Dataset summary
Evaluation for class label predictions
Evaluation # 1 (MAP)
• Compared the system’s ranked list of labels against a human ranked list of labels
• Metric - Mean Average Precision (MAP)
• Commonly used in the Information Retrieval domain to compare two ranked sets
Evaluation # 1 (MAP)
80.76 %
System Ranked:1. Person2. Politician3. President
Evaluator Ranked:1. President2. Politician3. OfficeHolder
Evaluation # 2 (Recall)
Recall > 0.6 (75 %)
System Ranked:1. Person2. Politician3. President
Evaluator Ranked:1. President2. Politician3. OfficeHolder
Evaluation # 3 (Correctness)
• Evaluated whether our predicted class labels were “fair and correct”
• Class label may not be the most accurate one, but may be correct. – E.g. dbpedia-owl:PopulatedPlace is not the most accurate, but still
a correct label for column of cities
• Three human judges evaluated our predicted class labels
Evaluation # 3 (Correctness)
• A category-wise breakdown for class label correctnessOverall
Accuracy: 76.92 %
Column – NationalityPrediction – MilitaryConflict
Column – Birth PlacePrediction – PopulatedPlace
Evaluation for linking table cells to entities
Category-wise accuracy for linking table cells
Overall Accuracy: 66.12 %
Relation between columns
• Idea – Ask human evaluators to identify relations between columns in a given table
• Pilot Experiment – Asked three evaluators to annotate five random tables from our dataset
• Evaluators identified 20 relations
• Our accuracy – 5 out of 20 (25 % ) were correct
Conclusion and Future Work
Conclusion
• We have demonstrated that it is possible to develop a automated framework for converting tables & spreadsheets to linked data
• Extending and adapting this framework for Open government data
• Discovery of new relations between entities
References• Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., Zhang, Y., 2008.
Webtables:exploring the power of tables on the web. Proc. VLDB Endow.1 (1), 538-549.
• Barrasa, J., Corcho, O., Gomez-perez, A., 2004. R2o, an extensible and semantically based database-to-ontology mapping language. In Proceedings of the 2nd Workshop on Semantic Web and Databases(SWDB2004). Vol. 3372. pp. 1069-1070.
• Hu, W., and Qu, Y. 2007. Discovering simple mappings between relational database schemas and ontologies. In Aberer, K.; Choi, K.-S.; Noy, N. F.; Allemang, D.; Lee, K.-I.; Nixon, L. J. B.; Golbeck, J.; Mika, P.; Maynard, D.; Mizoguchi, R.; Schreiber, G.;and Cudre-Mauroux, P., eds., ISWC/ASWC, volume 4825 of Lecture Notes in Computer Science, 225238. Springer.
• Papapanagiotou, P.; Katsiouli, P.; Tsetsos, V.; Anagnostopoulos, C.; and Hadjiefthymiades, S. 2006. Ronto: Relational to ontology schema matching. In AISSIGSEMIS BULLETIN.
• Lawrence, E. D. R. 2004. Composing mappings between schemas using a reference ontology. In In Proceedings of International Conference on Ontologies, Databases and Application of Semantics (ODBASE), 783800. Springer
• Han, L.; Finin, T.; Parr, C.; Sachs, J.; and Joshi, A. 2008. RDF123: from Spreadsheets to RDF. In Seventh International Semantic Web Conference. Springer.
• Han, L., Finin, T., Yesha, Y., 2009. Finding semantic web ontology terms from words. In: Proceedings of the Eight International Semantic Web Conference. Springer.
• Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. In: Proc. of the 36th Int'l Conference on Very Large Databases (VLDB). (2010)
References
This work was supported by: