DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 1 DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia Heiko Paulheim


Determining the semantic relatedness (i.e., the strength of a relation) of two resources in DBpedia (or other Linked Data sources) is a problem addressed by quite a few approaches in the recent past. However, there are no large-scale benchmark datasets for comparing such approaches, and it is an open problem to determine which of the approaches work better than others. Furthermore, larget-scale datasets for training machine learning based approaches are not available. DBpediaNYD is a large-scale synthetic silver standard benchmark dataset which contains symmetric and asymmetric similarity values, obtained using a web search engine.

Transcript of DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

Page 1: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 1

DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

Heiko Paulheim

Page 2: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 2


• There are quite a few approaches to entity ranking/statement weighting on Linked Data

– and DBpedia in particular

• Examples:

– Franz et al. (2009) – Tensor Decomposition

– Meij et al. (2009) – Machine Learning

– Mirizzi et al. (2010) – Web Search Engines

– Mulay and Kumar (2011) – Machine Learning

– Hees et al. (2012) – Crowd Sourcing

– Nunes et al. (2012) – Social Network Analysis

Page 3: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 3


• However,

– none of those have been competitively evaluated

– none of those have been evaluated at large scale

• Evaluation with

– small private data sets

– user studies

• Approaches using Machine Learning

– requires training data

– expensive to obtain

Page 4: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 4

The Dataset

• Large-scale dataset (several thousand instances)

– statements with strengths

• Strength value: Normalized Google Distance

• f(x): number of search results containing x

• f(x,y): number of search results containing both x and y

• M: number of pages in search engine index

• NGD has been shown to correlate with human strength associations

Page 5: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 5

The Dataset

• NGD is a symmetric value

– NYD dataset also contains asymmetric values

• Asymmetric Normalized Google Distance

• f(x): number of search results containing x

• f(x,y): number of search results containing both x and y

• M: number of pages in search engine index

Page 6: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 6

Constructing the Dataset

• We sampled 10,000 statements

– with DBpedia resources as subject and object(e.g., no type statements, no literals)

– with dbpedia or dbpprop predicate

• ...and computed symmetric/asymmetric NGD

– using the labels as search strings

– using Yahoo BOSS

Page 7: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 7

The Dataset

• Random sample of 10,000 statements

– i.e., 30,000 search engine calls (80c/1,000 → 24 USD)

• 3,058 pairs of resources had to be discarded

– f(x)<f(x,y) or f(y)<f(x,y)

– search engines sometimes don't count properly :-(

• Result:

– 6,942 weighted statements (symmetric)

– 13,884 weighted statements (asymmetric)

Page 8: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 8

The Dataset

• Example:

– dbpedia:John_Lennon and dbpedia:Yoko_Ono

• Distances:

– symmetric: 0.18

– John Lennon → Yoko Ono 0.18

– Yoko Ono → John Lennon 0.03

• Explanation:

– Yoko Ono is famous for being John Lennon's wife

• and most often mentioned in that context

– John Lennon is more famous for being a member of the Beatles

Page 9: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 9

Example: the DBpedia FindRelated Service

• We trained two regression SVMs (LibSVM) based on DBpediaNYD

– one for symmetric, one for asymmetric

– service allows for finding the most related among the linked resources

• Example results:

• http://wiki.dbpedia.org/FindRelated

Page 10: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 10

Conclusion and Outlook

• DBpediaNYD allows for large scale evaluation

– rather a silver standard

– does not replace manually created gold standards

• Future work

– validate DBpediaNYD with users

– compare search engines

Page 11: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 11

Something Completely Different

• Challenges enumerated in the workshop intro this morning

– “Logical inference on noisy data”

• Talk on “Type Inference on Noisy RDF Data”

– Was actually applied for DBpedia 3.9

– Friday, 3:15, Bayside 204A

Page 12: DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

10/22/13 Heiko Paulheim 12

DBpediaNYD – A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia

Heiko Paulheim