Extending DBpedia with Wikipedia List Pages

14
10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 1 Extending DBpedia with Wikipedia List Pages Heiko Paulheim, Simone Paolo Ponzetto

description

Thanks to its wide coverage and general-purpose ontology, DBpedia is a prominent dataset in the Linked Open Data cloud. DBpedia's content is harvested from Wikipedia's infoboxes, based on manually created mappings. In this paper, we explore the use of a promising source of knowledge for extending DBpedia, i.e., Wikipedia's list pages. We discuss how a combination of frequent pattern mining and natural language processing (NLP) methods can be leveraged in order to extend both the DBpedia ontology, as well as the instance information in DBpedia. We provide an illustrative example to show the potential impact of our approach and discuss its main challenges.

Transcript of Extending DBpedia with Wikipedia List Pages

Page 1: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 1

Extending DBpedia with Wikipedia List Pages

Heiko Paulheim, Simone Paolo Ponzetto

Page 2: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 2

Disclaimer

• This presentation shows an idea

– after all, it says “position paper”

– We don't know if it works!

– (but we are quite confident)

Page 3: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 3

Lists in Wikipedia

• Wikipedia loves lists

• As of June 2013, there are almost 600,000 list pages

• Lists organize Wikipedia pages

– that correspond to DBpedia instances

• Example:

– List of African-American writers

Page 4: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 4

Lists in Wikipedia

Page 5: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 5

Lists in Wikipedia

• Different types of lists

– simple bullet point lists

– broken bullet point lists (i.e., different sections)

• sometimes, the sections are semantically meaningful

– tables

– ...

Simple Bullet List

Broken Bullet List

Table

Other

Page 6: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 6

Lists in Wikipedia

• What information is in a list?

– the linked things have the same “type”

• The type can be a complex construct

– e.g.,

• Sometimes, there are more information bits

– e.g., birth dates for persons

Writer∩∀ nationality.{United States}∩∀ ethnicity.{African American}

Page 7: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 7

Extracting Information from Lists

• Goal:

– find the common characteristics of all things in the list

• Example: African-American writers

– all instances are writers

– all instances have nationality=United_States

– all instances have ethnicity=African_American

• Information in DBpedia is far from complete

– makes extraction difficult

– but: big potential to add information to DBpedia

25%

12%

3%

Page 8: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 8

Extracting Information from Lists

• Possible approach: finding characteristics with high TF-IDF

– TF: percentage of instances in the list that carry characteristic

– IDF: 1 / (percentage of all DBpedia instances that carry characteristic)

• Rationale: only going by frequency would rate owl:Thing the highest

• Example: African-American writers

– type=Writer: 0.608 (maximal across all possible classes)

– nationality=United_States: 0.277

– ethnicity=African_American: 0.127

• But:

– deathPlace=New_York_City: 0.157 :-(

Page 9: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 9

Extracting Information from Lists

• Example: African-American writers

– ethnicity=African_American: 0.127

– deathPlace=New_York_City: 0.157

• Exploit further information from list page

– e.g., wiki:African_American is linked from page, New_York_City is not

– e.g., analyze list page title, e.g., using DBpedia Spotlight

• African_American is recognized as an entity

Page 10: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 10

Lists of Lists in Wikipedia

• Wikipedia also knows ~600 lists of lists

– organize lists

– form a hierachy

• E.g.:

– Lists of Writers

– Lists of American writers

– List of African American writers

Page 11: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 11

From Lists of Lists to an Extended Ontology

• Idea:

– find corresponding lists of... pages for DBpedia classes

– extend hierarchy

owl:Thing

Agent

Person

Artist

Writer

...

...

...

... Lists of Writers

American Writer Lists of American Writers

African-American Writer ... List of African-American Writers

...

DBpedia Ontology

Extended Ontology

Corresponding Wikipedia page:

Page 12: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 12

Potential of the Idea

• Given that we extract everything correctly from List of African American writers, we get

– 814 new type statements (only DBpedia ontology)

– 1409 new property assertions

– two entirely new instances

• ...and there are ~600,000 list pages

– extrapolation: we can roughly double the information in DBpedia

• many list pages contain extra information

– e.g., birth places and birth dates of persons

Page 13: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 13

Challenges

• Robust extraction of instances

– from different kinds of list pages

– e.g., picking the right column in a table

– tables and bullet point lists already make for 75%

• Picking good scoring functions

– TF-IDF seems not bad at first glance

• Combining statistical and textual evidence

• Scalable implementation

– Advantage: perfectly parallelizable

Page 14: Extending DBpedia with Wikipedia List Pages

10/22/13 Heiko Paulheim, Simone Paolo Ponzetto 14

Extending DBpedia with Wikipedia List Pages

Heiko Paulheim, Christian Bizer