T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia...

57
T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234

Transcript of T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia...

Page 1: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

T. Flati, D. Vannella, T. Pasini, R. Navigli

2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project

ERC Starting GrantMultiJEDI No. 259234

Page 2: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

The Wikipedia structure

Article pages~4M

Category pages~ 700K

Two noisy graphs with no explicit hypernym relation.

Page 3: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

The Wikipedia structure: an examplePages Categories

Mickey Mouse

Funny AnimalSuperman

Cartoon

Donald Duck

Disney comics characters

Disney comicsDisney character

Fictional characters by

medium

Comics by genre

Fictional characters

The Walt Disney Company

Page 4: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Our goal

To automatically create a Wikipedia Bitaxonomy for Wikipedia pages and categories in a

simultaneous fashion.

pages categories

Page 5: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Our goal

To automatically create a Wikipedia Bitaxonomy for Wikipedia pages and categories in a

simultaneous fashion.

The page and category level are mutually beneficial for inducing a wide-coverage and fine-grained integrated taxonomy

KEY IDEA

Page 6: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Key idea Pages Categories

Disney comics characters

Disney comicsDisney character

The Walt Disney Company

Fictional characters by

medium

Comics by genre

Fictional characters

Mickey Mouse

Funny AnimalSuperman

Cartoon

Donald Duckis a

is a

is a

is a

is a

is a

is ais a is a

Page 7: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

A 3-phase method

pages categories

Starting from two noisy graphs

Page 8: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

A 3-phase method1. Build the page taxonomy

pages

Page 9: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

A 3-phase method1. Build the page taxonomy2. Bitaxonomy Algorithm

pages categories

Page 10: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

A 3-phase method

pages categories

1. Build the page taxonomy2. Bitaxonomy Algorithm

Page 11: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

pages

1. Build the page taxonomy

A 3-phase method

+50%categories

categories

3. Refine the category taxonomy2. Bitaxonomy Algorithm

Page 12: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Contributions

1. Self-contained approach

2. Page taxonomy and category taxonomy built simultaneously

3. State-of-the-art results when compared to all other available taxonomies

Page 13: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

The WiBi Page taxonomy1

Page 14: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Assumptions

• The first sentence of a page is a good definition (also called

gloss)

Page 15: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

The WiBi Page taxonomy

1. [Syntactic step]Extract the hypernym lemma from a page definition using a syntactic parser;

2. [Semantic step]Apply a set of linking heuristics to disambiguate the extracted lemma.

Scrooge McDuck is a character […]

Syntactic step

Hypernym lemma: character

A

Semantic step

Scrooge McDuck is a character[…]nn nsubj

cop

Page 16: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

The semantic step

5 cascadinglinking heuristics

Ambiguoushypernym(‘player’)

Linking heuristic

Target page(Cristiano Ronaldo)

Disambiguatedhypernym

(Football player)

1. Crowdsourced2. Category3. Multiword4. Monosemous5. Distributional

Page 17: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

1. Crowdsourced heuristic

Mickey Mouse is a funny animal cartoon character and the official mascot ofThe Walt Disney Company.

Use the links from the crowd!

Page 18: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.

Characters in Disney package films

Disney comics charactersAmbiguous

hypernym: Character

Donald Duck Pluto

Hook

Mickey Mouse

José Carioca

2. Category heuristic

Goofy

Page 19: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

2. Category heuristicGiven a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.

Donald Duck Pluto

Hook

Mickey Mouse

José Carioca

Goofy

Goofy is a funny animal cartoon character […]

José Carioca  is a Disney cartoon character […]

Captain James Hook  is a fictional character […]

Mickey Mouse is a funny animal cartoon character […]

Pluto, also called Pluto the Pup, is a cartoon character […]

Mickey Mouse is a funny animal cartoon character […]

Characters in Disney package films

Disney comics charactersAmbiguous

hypernym: Character

Page 20: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

2. Category heuristicGiven a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.

Donald Duck

Goofy is a funny animal cartoon character […]

José Carioca  is a Disney cartoon character […]

Captain James Hook  is a fictional character […]

Mickey Mouse is a funny animal cartoon character […]

Pluto, also called Pluto the Pup, is a cartoon character […]

Mickey Mouse is a funny animal cartoon character […]

Character (arts) 5, Funny animal 1

Character (arts) 3, Funny animal 1, Cartoon 1

Character(arts) 8, Funny animal 2, Cartoon 1Ambiguous hypernym: Character

Characters in Disney package films

Disney comics characters

Page 21: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Character(arts) 8, Funny animal 2, Cartoon 1

2. Category heuristicGiven a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.

Donald Duck

Character(arts)Ambiguous hypernym: Character

Page 22: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Page taxonomy linking heuristics

Category(1.603M)

Multiword(65K) Monosemous

(161K)

Distributional(561K)

Crowdsourced(1.338M)

1

2

34

5

Page 23: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Page taxonomy evaluation

Page 24: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

The story so far

1

Noisy page graph Page taxonomy

Page 25: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

2The Bitaxonomyalgorithm

Page 26: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

The Bitaxonomy algorithm

The information available in the two taxonomies is mutually beneficial;● At each step exploit one taxonomy to update

the other and vice versa;● Repeat until convergence.

Page 27: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

pages categories

Real MadridF.C.

Football team Football teams

Football clubsin Madrid

is a

Atlético Madrid

The Bitaxonomy algorithm

Football clubs

Starting from the page taxonomy

Page 28: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Real MadridF.C.

Football team Football teams

Football clubsin Madrid

is a

is a

The Bitaxonomy algorithm

Football clubs

Exploit the cross links to infer hypernym relations in the category taxonomy

Atlético Madrid

pages categories

Page 29: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Real MadridF.C.

Football team Football teams

Football clubsin Madrid

is a

is a

is a

The Bitaxonomy algorithm

Football clubs

Take advantage of cross links to infer back is-a relations in the page taxonomy

Atlético Madrid

pages categories

Page 30: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Real MadridF.C.

Football team Football teams

Football clubsin Madrid

is a

is a

is a

The Bitaxonomy algorithm

Football clubs

is a

Use the relations found in previous step to infer new hypernym edges

Atlético Madrid

pages categories

Page 31: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Atlético MadridReal Madrid

F.C.

Football team Football teams

Football clubsin Madrid

is a

is a

is a

The Bitaxonomy algorithm

Football clubs

is a

Mutual enrichment of both taxonomies until convergence

pages categories

Page 32: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Page taxonomy evaluation (cont’d)Sensible 3% increment in terms of recall and coverage,with unvaried precision

Page 33: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Category taxonomy evaluation

Page 34: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

The story so far

2

Page 35: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

3The WiBi category taxonomy refinement

Page 36: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Comics characters by protagonist

Comics characters

Garfield characters

Category taxonomy refinement

Some categories are affected by some structural problems.

pages categories

No pagesassociated!

Page 37: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Category taxonomy refinement● 3 refinement procedures to obtain broader

coverage for categorieso Single super categoryo Sub-categorieso Super-categories

Page 38: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Single super category

This category has only 1 outgoing edge

Comics characters by protagonist

Comics characters

Garfield characters

Animated television characters by series

Animated characters

Fictional characters by medium

Animation

So we promote its only super category to hypernym

Page 39: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Sub-categories

Comics characters by company

Disney comics

Comics by companyComics characters

DC Comicscharacters

Marvel Comicscharacters

Comics titlesby company

Focus on subcategories which have already been covered!

Page 40: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Sub-categories

Comics characters by company

Disney comics

Comics by companyComics characters

DC Comicscharacters

Comics titlesby company

Marvel Comicscharacters

Focus on subcategories which have already been covered!

Only 1 path ending in u

2 pathsending in v

Page 41: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Category taxonomy evaluation: coverage

+50%categoriescovered!

1SUP SUB SUPER

Page 42: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Category taxonomy evaluation: P & R

Iterations1SUP SUB SUPER

+35%recall

86%

Page 43: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Experimental setup

● We created 2 datasets:o 1000 randomly sampled pages;o 1000 randomly sampled categories.

● Each item was annotated with the most suitable generalization (lemma+page or category).

Page 44: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Competitors

WikiNet

MENTA

WikiTaxonomy

pages categories

Page 45: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Measures

● We calculated typical measures to assess the quality of all the possible taxonomies;o Precisiono Recallo Coverageo Specificityo Granularity

Page 46: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Page taxonomy comparison

Page 47: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Page taxonomy comparison

Page 48: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Category taxonomy comparison

Page 49: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Category taxonomy comparison

Page 50: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Category taxonomy comparison

Specificitymeasure

Page 51: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Measuring specificityA system is more specific than another when the hypernym(s) provided by the former are more specific/informative than the latter.

System 1

“Singer”System 2

“Swing singer”

“Frank Sinatra is a”

<less specific than

Page 52: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Page taxonomy specificityRatio of the times in which WiBi provided

a more specificanswer than the other system

Page 53: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Page taxonomy specificityRatio of the times in which WiBi

provided a less specific answer than the other system

Page 54: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Category taxonomy specificity

Page 55: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Measuring granularity

pages categories

Page 56: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Conclusions● Unified, 3-phase approach

to the construction of a bitaxonomyfor the English Wikipedia;

● Self-contained, no additionalresources or supervision required;

● Nearly full coverage of Wikipedia pages and categories;● State-of-the-art performance both on pages and categories.

wibitaxonomy.org

Page 57: T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234.

Tiziano Flati, Daniele Vannella, Tommaso Pasini, Roberto Navigli

Linguistic Computing Laboratorylcl.uniroma1.it