T. Flati, D. Vannella, T. Pasini, R. Navigli
2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project
ERC Starting GrantMultiJEDI No. 259234
The Wikipedia structure
Article pages~4M
Category pages~ 700K
Two noisy graphs with no explicit hypernym relation.
The Wikipedia structure: an examplePages Categories
Mickey Mouse
Funny AnimalSuperman
Cartoon
Donald Duck
Disney comics characters
Disney comicsDisney character
Fictional characters by
medium
Comics by genre
Fictional characters
The Walt Disney Company
Our goal
To automatically create a Wikipedia Bitaxonomy for Wikipedia pages and categories in a
simultaneous fashion.
pages categories
Our goal
To automatically create a Wikipedia Bitaxonomy for Wikipedia pages and categories in a
simultaneous fashion.
The page and category level are mutually beneficial for inducing a wide-coverage and fine-grained integrated taxonomy
KEY IDEA
Key idea Pages Categories
Disney comics characters
Disney comicsDisney character
The Walt Disney Company
Fictional characters by
medium
Comics by genre
Fictional characters
Mickey Mouse
Funny AnimalSuperman
Cartoon
Donald Duckis a
is a
is a
is a
is a
is a
is ais a is a
A 3-phase method
pages categories
Starting from two noisy graphs
A 3-phase method1. Build the page taxonomy
pages
A 3-phase method1. Build the page taxonomy2. Bitaxonomy Algorithm
pages categories
A 3-phase method
pages categories
1. Build the page taxonomy2. Bitaxonomy Algorithm
pages
1. Build the page taxonomy
A 3-phase method
+50%categories
categories
3. Refine the category taxonomy2. Bitaxonomy Algorithm
Contributions
1. Self-contained approach
2. Page taxonomy and category taxonomy built simultaneously
3. State-of-the-art results when compared to all other available taxonomies
The WiBi Page taxonomy1
Assumptions
• The first sentence of a page is a good definition (also called
gloss)
The WiBi Page taxonomy
1. [Syntactic step]Extract the hypernym lemma from a page definition using a syntactic parser;
2. [Semantic step]Apply a set of linking heuristics to disambiguate the extracted lemma.
Scrooge McDuck is a character […]
Syntactic step
Hypernym lemma: character
A
Semantic step
Scrooge McDuck is a character[…]nn nsubj
cop
The semantic step
5 cascadinglinking heuristics
Ambiguoushypernym(‘player’)
Linking heuristic
Target page(Cristiano Ronaldo)
Disambiguatedhypernym
(Football player)
1. Crowdsourced2. Category3. Multiword4. Monosemous5. Distributional
1. Crowdsourced heuristic
Mickey Mouse is a funny animal cartoon character and the official mascot ofThe Walt Disney Company.
Use the links from the crowd!
Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.
Characters in Disney package films
Disney comics charactersAmbiguous
hypernym: Character
Donald Duck Pluto
Hook
Mickey Mouse
José Carioca
2. Category heuristic
Goofy
2. Category heuristicGiven a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.
Donald Duck Pluto
Hook
Mickey Mouse
José Carioca
Goofy
Goofy is a funny animal cartoon character […]
José Carioca is a Disney cartoon character […]
Captain James Hook is a fictional character […]
Mickey Mouse is a funny animal cartoon character […]
Pluto, also called Pluto the Pup, is a cartoon character […]
Mickey Mouse is a funny animal cartoon character […]
Characters in Disney package films
Disney comics charactersAmbiguous
hypernym: Character
2. Category heuristicGiven a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.
Donald Duck
Goofy is a funny animal cartoon character […]
José Carioca is a Disney cartoon character […]
Captain James Hook is a fictional character […]
Mickey Mouse is a funny animal cartoon character […]
Pluto, also called Pluto the Pup, is a cartoon character […]
Mickey Mouse is a funny animal cartoon character […]
Character (arts) 5, Funny animal 1
Character (arts) 3, Funny animal 1, Cartoon 1
Character(arts) 8, Funny animal 2, Cartoon 1Ambiguous hypernym: Character
Characters in Disney package films
Disney comics characters
Character(arts) 8, Funny animal 2, Cartoon 1
2. Category heuristicGiven a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.
Donald Duck
Character(arts)Ambiguous hypernym: Character
Page taxonomy linking heuristics
Category(1.603M)
Multiword(65K) Monosemous
(161K)
Distributional(561K)
Crowdsourced(1.338M)
1
2
34
5
Page taxonomy evaluation
The story so far
1
Noisy page graph Page taxonomy
2The Bitaxonomyalgorithm
The Bitaxonomy algorithm
The information available in the two taxonomies is mutually beneficial;● At each step exploit one taxonomy to update
the other and vice versa;● Repeat until convergence.
pages categories
Real MadridF.C.
Football team Football teams
Football clubsin Madrid
is a
Atlético Madrid
The Bitaxonomy algorithm
Football clubs
Starting from the page taxonomy
Real MadridF.C.
Football team Football teams
Football clubsin Madrid
is a
is a
The Bitaxonomy algorithm
Football clubs
Exploit the cross links to infer hypernym relations in the category taxonomy
Atlético Madrid
pages categories
Real MadridF.C.
Football team Football teams
Football clubsin Madrid
is a
is a
is a
The Bitaxonomy algorithm
Football clubs
Take advantage of cross links to infer back is-a relations in the page taxonomy
Atlético Madrid
pages categories
Real MadridF.C.
Football team Football teams
Football clubsin Madrid
is a
is a
is a
The Bitaxonomy algorithm
Football clubs
is a
Use the relations found in previous step to infer new hypernym edges
Atlético Madrid
pages categories
Atlético MadridReal Madrid
F.C.
Football team Football teams
Football clubsin Madrid
is a
is a
is a
The Bitaxonomy algorithm
Football clubs
is a
Mutual enrichment of both taxonomies until convergence
pages categories
Page taxonomy evaluation (cont’d)Sensible 3% increment in terms of recall and coverage,with unvaried precision
Category taxonomy evaluation
The story so far
2
3The WiBi category taxonomy refinement
Comics characters by protagonist
Comics characters
Garfield characters
Category taxonomy refinement
Some categories are affected by some structural problems.
pages categories
No pagesassociated!
Category taxonomy refinement● 3 refinement procedures to obtain broader
coverage for categorieso Single super categoryo Sub-categorieso Super-categories
Single super category
This category has only 1 outgoing edge
Comics characters by protagonist
Comics characters
Garfield characters
Animated television characters by series
Animated characters
Fictional characters by medium
Animation
So we promote its only super category to hypernym
Sub-categories
Comics characters by company
Disney comics
Comics by companyComics characters
DC Comicscharacters
Marvel Comicscharacters
Comics titlesby company
Focus on subcategories which have already been covered!
Sub-categories
Comics characters by company
Disney comics
Comics by companyComics characters
DC Comicscharacters
Comics titlesby company
Marvel Comicscharacters
Focus on subcategories which have already been covered!
Only 1 path ending in u
2 pathsending in v
Category taxonomy evaluation: coverage
+50%categoriescovered!
1SUP SUB SUPER
Category taxonomy evaluation: P & R
Iterations1SUP SUB SUPER
+35%recall
86%
Experimental setup
● We created 2 datasets:o 1000 randomly sampled pages;o 1000 randomly sampled categories.
● Each item was annotated with the most suitable generalization (lemma+page or category).
Competitors
WikiNet
MENTA
WikiTaxonomy
pages categories
Measures
● We calculated typical measures to assess the quality of all the possible taxonomies;o Precisiono Recallo Coverageo Specificityo Granularity
Page taxonomy comparison
Page taxonomy comparison
Category taxonomy comparison
Category taxonomy comparison
Category taxonomy comparison
Specificitymeasure
Measuring specificityA system is more specific than another when the hypernym(s) provided by the former are more specific/informative than the latter.
System 1
“Singer”System 2
“Swing singer”
“Frank Sinatra is a”
<less specific than
Page taxonomy specificityRatio of the times in which WiBi provided
a more specificanswer than the other system
Page taxonomy specificityRatio of the times in which WiBi
provided a less specific answer than the other system
Category taxonomy specificity
Measuring granularity
pages categories
Conclusions● Unified, 3-phase approach
to the construction of a bitaxonomyfor the English Wikipedia;
● Self-contained, no additionalresources or supervision required;
● Nearly full coverage of Wikipedia pages and categories;● State-of-the-art performance both on pages and categories.
wibitaxonomy.org
Tiziano Flati, Daniele Vannella, Tommaso Pasini, Roberto Navigli
Linguistic Computing Laboratorylcl.uniroma1.it
Top Related