Mining Interesting Trivia for Entities from Wikipedia PART-II

25
Mining Interesting Trivia for Entities from Wikipedia Supervised By: Presented By: Dr. Dhaval Patel, Assistant Professor, IIT Roorkee Abhay Prakash, En. No. - 10211002, IIT Roorkee Dr. Manoj K. Chinnakotla, Applied Researcher, Microsoft India

Transcript of Mining Interesting Trivia for Entities from Wikipedia PART-II

Page 1: Mining Interesting Trivia for Entities from Wikipedia PART-II

Mining Interesting Trivia for Entities from Wikipedia

Supervised By: Presented By:

Dr. Dhaval Patel,Assistant Professor,IIT Roorkee

Abhay Prakash,En. No. - 10211002,

IIT Roorkee

Dr. Manoj K. Chinnakotla,Applied Researcher,Microsoft India

Page 2: Mining Interesting Trivia for Entities from Wikipedia PART-II

Publication Accepted[1] Abhay Prakash, Manoj K. Chinnakotla, Dhaval Patel, Puneet Garg: “Did you know?- Mining Interesting Trivia for Entities from Wikipedia”. In 24th

International Joint Conference on Artificial Intelligence (IJCAI), 2015.

Conference Rating: A*

Page 3: Mining Interesting Trivia for Entities from Wikipedia PART-II

Introduction: Problem StatementDefinition: Trivia are any facts about an entity which are interesting due to any of the following characteristics - unusualness, uniqueness, unexpectedness or weirdness. Generally appear in “Did you know?” articles

E.g. “To prepare for Joker’s role, Heath Ledger secluded himself in a hotel room for a month” [Batman Begins]

Unusual for an actor/human to seclude himself for a month

Problem Statement: For a given entity, mine top-k interesting trivia from its Wikipedia page, where a trivia is considered interesting if when it is shown to 𝑁 persons, more than 𝑁/2 persons find it interesting. For evaluation of unseen set, we chose 𝑁 = 5 (statistical significance discussed in mid evaluation)

Page 4: Mining Interesting Trivia for Entities from Wikipedia PART-II

Wikipedia Trivia Miner (WTM) Based on ML approach to mine trivia from unstructured text

Trains a ranker using sample trivia of target domain Experiment with Movie entities and Celebrity entities

Harness trained ranker to mine Trivia from entity’s Wikipedia page Retrieves Top-k standalone interesting sentences from entity’s page

Why Wikipedia? Reliable for factual correctness

Ample # of interesting trivia (56/100 in expt.)

Page 5: Mining Interesting Trivia for Entities from Wikipedia PART-II

System Architecture Filtering & Grading Filters out noisy samples

Give a grade to each sample, as reqd. by ranker

Interestingness Ranker Extracts features from the samples/candidates

Trains ranker(SVMrank)/Ranks candidates

Candidate Selection Identifies candidates from Wikipedia

CandidateSelection

Human Voted Trivia Source

Train Dataset Candidates’ Source

Top-K Interesting Triviafrom Candidates

Wikipedia Trivia Miner (WTM)

Interestingness Ranker

Filtering & Grading

Feature Extraction Feature ExtractionSVMrank

Knowledge Base

Page 6: Mining Interesting Trivia for Entities from Wikipedia PART-II

CandidateSelection

Candidates’ Source

Top-K Interesting Triviafrom Candidates

Feature ExtractionSVMrank

Knowledge Base

Retrieval Phase

Human Voted Trivia Source

Train Dataset

Filtering & Grading

Feature Extraction SVMrank

Train Phase

Model

Execution Phases Train Phase Crawls and prepares train data

Featurize the train data

Trains SVMrank to build a model

Retrieval Phase Crawls entity’s Wikipedia text

Identify candidates for trivia

Featurize the candidates

Rank the candidates using already built model

Page 7: Mining Interesting Trivia for Entities from Wikipedia PART-II

Feature EngineeringBucket Feature Significance Sample features Example Trivia

Unigram (U)Features

Each word’sTF-IDF

Identify imp. words which make the trivia interesting

“stunt”, “award”, “improvise”

“Tom Cruise did all of his own stunt driving.”

Linguistic (L)Features

SuperlativeWords

Shows the extremeness (uniqueness)

“best”, “longest”, “first”

“The longest animated Disney film since Fantasia (1940).”

ContradictoryWords

Opposing ideas could spark intrigue and interest

“but”, “although”, “unlike”

“The studios wanted Matthew McConaugheyfor lead role, but James Cameron insisted on Leonardo DiCaprio.”

Root Word(Main Verb)

Captures core activity being discussed in the sentence

root_gross “Gravity grossed $274 Mn in North America”

Subject Word(First Noun)

Captures core thing being discussed in the sentence

subj_actor “The actors snorted crushed B vitamins for scenes involving cocaine”

Readability Complex and lengthy trivia are hardly interesting

FOG Index binned in 3 bins ---

Page 8: Mining Interesting Trivia for Entities from Wikipedia PART-II

Feature Engineering (Contd…)

Bucket Feature Significance Sample features Example Trivia

Entity (E)Features

Generic NEs captures general about-ness

MONEY, ORGANIZATION, PERSON, DATE, TIME and LOCATION

“The guns in the film were supplied by Aldo Uberti Inc., a company in Italy.”

• ORGANIZATION and LOCATION

RelatedEntities

captures specific about-ness(Entities resolved using DBPedia)

entity_producer,entity_director

“According to Victoria Alonso, Rocket Raccoonand Groot were created through a mix of motion-capture and rotomation VFX.”

• entity_producer, entity_character

Entity Linking before(L) Parsing

Captures generalized story of sentence

subj_entity_producer [The same trivia above]• “According to entity_producer, …”• subj_Victoria subj_entity_producer

Focus Entities Captures core entities being talked about

underroot_entity_producer

[The same trivia above]• underroot_entity_producer,

underroot_entity_character

Page 9: Mining Interesting Trivia for Entities from Wikipedia PART-II

Feature Engineering: ExampleEx. “According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of motion-capture and rotomation VFX.”

Features extracted: 18025 (U) + 5 (L) + 4686 (E) columns in total for all train data

Rest of the features have value 0. entity_actor = 0, award = 0, subj_actor = 0, root_win = 0, ….

create mix motion capture rotomation VFX root_create supPOS subj_entity_producer FOG

0.25 0.75 0.96 0.4 0.85 0.75 1 0 1 3

contradictory entity_producer entity_character underroot_entiy_producer underroot_entity_character

0 1 1 1 1

Page 10: Mining Interesting Trivia for Entities from Wikipedia PART-II

Comparative ApproachesI. Random [Baseline I]:

- 10 sentences picked randomly from Wikipedia

II. CS + Random

- Candidates Selected (standalone context independent sentences)

- i.e., remove sentences like “it really reminds me of my childhood”

- 10 sentences picked randomly from candidates

III. CS + supPOS(Best) [Baseline II]:

- Candidates Selected

- Ranked by # of sup. words

- Deliberately taking interesting sent. for same # of sup. words

Rank # of sup. words

Class

1 2 Interesting

2 2 Boring

3 1 Interesting

4 1 Interesting

5 1 Interesting

6 1 Boring

7 1 Boring

supPOS (Best Case)

Page 11: Mining Interesting Trivia for Entities from Wikipedia PART-II

Variants of WTMI. WTM (U)

- Candidates Selected

- ML Ranking of candidates using only Unigram Features

II. WTM (U+L+E)

- Candidates Selected

- ML Ranking of candidates using all features: Unigram (U) + Linguistic (L) + Entity (E)

Page 12: Mining Interesting Trivia for Entities from Wikipedia PART-II

Results: P@10 Metric is Precision at 10 (P@10), which

means out of top 10 ranked candidates, how many actually are interesting

0.25

0.3

0.34 0.34

0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Random CS+Random supPOS(Best Case)

WTM (U) WTM(U+L+E)

P@

10

Approaches

Page 13: Mining Interesting Trivia for Entities from Wikipedia PART-II

Results: P@10 Metric is Precision at 10 (P@10), which

means out of top 10 ranked candidates, how many actually are interesting

CS+Random > Random

Shows significance of Candidate Selection

0.25

0.3

0.34 0.34

0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Random CS+Random supPOS(Best Case)

WTM (U) WTM(U+L+E)

P@

10

Approaches

Page 14: Mining Interesting Trivia for Entities from Wikipedia PART-II

Results: P@10 Metric is Precision at 10 (P@10), which

means out of top 10 ranked candidates, how many actually are interesting

CS+Random > Random

Shows significance of Candidate Selection

WTM (U+L+E) >> WTM (U)

Shows significance of Engineered Linguistic (L) and Entity (E) Features

0.25

0.3

0.34 0.34

0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Random CS+Random supPOS(Best Case)

WTM (U) WTM(U+L+E)

P@

10

Approaches

Page 15: Mining Interesting Trivia for Entities from Wikipedia PART-II

Results: Recall@K supPOS limited to one kind of trivia

WTM captures varied types 62% recall till rank 25

Performance Comparison supPOS better till rank 3

Soon after rank 3, WTM beats superPOS0

10

20

30

40

50

60

70

0 5 10 15 20 25

% R

ecal

l

Rank

SuperPOS (Best Case) WTM Random

Page 16: Mining Interesting Trivia for Entities from Wikipedia PART-II

Sensitivity to Training Size Current Results reported with 6163 Train

Trivia

WTM precision increases with train size

Desirable property as precision can beimproved by taking more train data

Page 17: Mining Interesting Trivia for Entities from Wikipedia PART-II

WTM’s Domain Independence Experiment on Celebrity Domain to justify claim of domain independence.

Dataset: Crawled Trivia for Top 1000 Movie celebrities from IMDB and did 5 fold test

Train dataset: 4459 Trivia (106 entities)

Test dataset: 500 Trivia (10 entities)

Doubtful feature for being domain dependent – Entity Features

Unigram (E) Features Linguistic (L) Features Entity (E) Features

All words subj_actor, root_reveal,subj_scene, but, best, FOG_index = 7.2

entity_producer, entity_director, …

Page 18: Mining Interesting Trivia for Entities from Wikipedia PART-II

WTM’s Domain Independence (Contd…)

Entity Features are domain independent too

Entity Features are automatically generated using attribute:value pairs in DBpedia For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’

Unigram (U) and Linguistic (L) features clearly domain independent

DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)

Page 19: Mining Interesting Trivia for Entities from Wikipedia PART-II

WTM’s Domain Independence (Contd…)

Entity Features are domain independent too

Entity Features are automatically generated using attribute:value pairs in DBpedia For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’

Unigram (U) and Linguistic (L) features clearly domain independent

DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)

Page 20: Mining Interesting Trivia for Entities from Wikipedia PART-II

FEATURE ENTITY TRIVIA

entity_partner Johnny Depp Engaged to Amber Heard [January 17, 2014].**

entity_citizenship Nicole Kidman First Australian actress to win the Best Actress Academy Award.

** After Entity Linking sentence parsed as “Engaged to entity_partner”

Entity Feature Generation from DBpedia

Example of Entity Features in Celebrity Domain

WTM’s Domain Independence (Contd…)

Movie Domain (ex. Batman Begins (2005) ) Celebrity Domain (ex. Angelina Jolie)

DBpedia attribute:value Feature generated DBpedia attribute:value Feature generated

Director: Christopher Nolan entity_director Partner: Brad Pitt entity_partner

Producer: Larry J. Franco entity_producer birthplace: California entity_birthPlace

Page 21: Mining Interesting Trivia for Entities from Wikipedia PART-II

Feature Contribution (Movie v/s Celeb.)

Rank Feature Group

1 win Unigram

3 magazine Unigram

4 superPOS Linguistic

5 MONEY Entity (NER)

6 entity_alternativenames Entity

7 root_engage Linguistic

14 subj_earnings Linguistic

15 subj_entity_children Linguistic + Entity

18 entity_birthplace Entity

19 subj_unlinked_location Linguistic + Entity

Rank Feature Group

1 subj_scene Linguistic

2 subj_entity_cast Linguistic + Entity

3 entity_produced_by Entity

4 underroot_unlinked_organization Linguistic + Entity

6 root_improvise Linguistic

7 entity_character Entity

8 MONEY Entity (NER)

14 stunt Unigram

16 superPOS Linguistic

17 subj_actor Linguistic

Top Features: Our advanced features are useful and intuitive for humans too

Entity Linking leads to better generalization (instead of entity_wolverine, model gets entity_cast)

Movie Domain Celebrity Domain

Page 22: Mining Interesting Trivia for Entities from Wikipedia PART-II

Results: P@10 (Celebrity Domain)

0.39

0.540.58

0.71

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Random supPOS(BestCase)

WTM (U) WTM(U+L+E)

P@

10

Approaches

Again WTM (U+L+E) >> WTM (U) Significance of advanced (L) and (E)

features

Hence, Features and Approach areDomain Independent

For entities of any domain, just replaceTrain Data (Sample Trivia)

Page 23: Mining Interesting Trivia for Entities from Wikipedia PART-II

Dissertation Contribution Identified, Defined and Provided a novel research problem not just only providing solutions to existing problem

Proposed a Domain Independent system “Wikipedia Trivia Miner (WTM)” To mine top-k interesting trivia for any given entity based on their interestingness

Engineered features that capture ‘about-ness’ of sentence Generalizes which one are interesting

Proposed a mechanism to prepare ground truth for test-set Cost-effective but statistically significant

Page 24: Mining Interesting Trivia for Entities from Wikipedia PART-II

Future Works New Features to increase Ranking Quality Unusualness: Probability of occurrence of the sentence in considered domain

Fact Popularity: Lesser known trivia could be more interesting to majority people

Trying Deep Learning Could be helpful as in case of sarcasm detection

Generating Questions from mined trivia To present Trivia in question form

Obtaining personalized Interesting Trivia In this dissertation work, we took interesting based on majority voting. Ranking based on user

demographics

Page 25: Mining Interesting Trivia for Entities from Wikipedia PART-II