Extending DBpedia with Wikipedia List Pages
-
Upload
heiko-paulheim -
Category
Technology
-
view
177 -
download
0
description
Transcript of Extending DBpedia with Wikipedia List Pages
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 1
Extending DBpedia with Wikipedia List Pages
Heiko Paulheim, Simone Paolo Ponzetto
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 2
Disclaimer
• This presentation shows an idea
– after all, it says “position paper”
– We don't know if it works!
– (but we are quite confident)
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 3
Lists in Wikipedia
• Wikipedia loves lists
• As of June 2013, there are almost 600,000 list pages
• Lists organize Wikipedia pages
– that correspond to DBpedia instances
• Example:
– List of African-American writers
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 4
Lists in Wikipedia
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 5
Lists in Wikipedia
• Different types of lists
– simple bullet point lists
– broken bullet point lists (i.e., different sections)
• sometimes, the sections are semantically meaningful
– tables
– ...
Simple Bullet List
Broken Bullet List
Table
Other
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 6
Lists in Wikipedia
• What information is in a list?
– the linked things have the same “type”
• The type can be a complex construct
– e.g.,
• Sometimes, there are more information bits
– e.g., birth dates for persons
Writer∩∀ nationality.{United States}∩∀ ethnicity.{African American}
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 7
Extracting Information from Lists
• Goal:
– find the common characteristics of all things in the list
• Example: African-American writers
– all instances are writers
– all instances have nationality=United_States
– all instances have ethnicity=African_American
• Information in DBpedia is far from complete
– makes extraction difficult
– but: big potential to add information to DBpedia
25%
12%
3%
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 8
Extracting Information from Lists
• Possible approach: finding characteristics with high TF-IDF
– TF: percentage of instances in the list that carry characteristic
– IDF: 1 / (percentage of all DBpedia instances that carry characteristic)
• Rationale: only going by frequency would rate owl:Thing the highest
• Example: African-American writers
– type=Writer: 0.608 (maximal across all possible classes)
– nationality=United_States: 0.277
– ethnicity=African_American: 0.127
• But:
– deathPlace=New_York_City: 0.157 :-(
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 9
Extracting Information from Lists
• Example: African-American writers
– ethnicity=African_American: 0.127
– deathPlace=New_York_City: 0.157
• Exploit further information from list page
– e.g., wiki:African_American is linked from page, New_York_City is not
– e.g., analyze list page title, e.g., using DBpedia Spotlight
• African_American is recognized as an entity
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 10
Lists of Lists in Wikipedia
• Wikipedia also knows ~600 lists of lists
– organize lists
– form a hierachy
• E.g.:
– Lists of Writers
– Lists of American writers
– List of African American writers
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 11
From Lists of Lists to an Extended Ontology
• Idea:
– find corresponding lists of... pages for DBpedia classes
– extend hierarchy
owl:Thing
Agent
Person
Artist
Writer
...
...
...
... Lists of Writers
American Writer Lists of American Writers
African-American Writer ... List of African-American Writers
...
DBpedia Ontology
Extended Ontology
Corresponding Wikipedia page:
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 12
Potential of the Idea
• Given that we extract everything correctly from List of African American writers, we get
– 814 new type statements (only DBpedia ontology)
– 1409 new property assertions
– two entirely new instances
• ...and there are ~600,000 list pages
– extrapolation: we can roughly double the information in DBpedia
• many list pages contain extra information
– e.g., birth places and birth dates of persons
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 13
Challenges
• Robust extraction of instances
– from different kinds of list pages
– e.g., picking the right column in a table
– tables and bullet point lists already make for 75%
• Picking good scoring functions
– TF-IDF seems not bad at first glance
• Combining statistical and textual evidence
• Scalable implementation
– Advantage: perfectly parallelizable
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 14
Extending DBpedia with Wikipedia List Pages
Heiko Paulheim, Christian Bizer