1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

64
1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias http://labs.yahoo.com/Yahoo_Labs_Barcelona

Transcript of 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

Page 1: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

1Seminari UB 2010

NLP Research at Yahoo! BarcelonaJordi Atserias

http://labs.yahoo.com/Yahoo_Labs_Barcelona

Page 2: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

2Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

Page 3: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

3Seminari UB 2010

Hugo Zaragoza

Ricardo Baeza

http://research.yahoo.com/Yahoo_Research_Barcelona

Page 4: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

4Seminari UB 2010

Natural Language Retrieval Group

Giuseppe Attardi (U. Pisa)

- Indexing (IXE) - Dependency Parsing

Hugo Zaragoza - ML- IR

“Every time I fire a linguist my performance goes up…” (Fred Jelinek)

Great strategy until you’ve fired them all… but what then?

Michael Matthews - NLP

Jordi Atserias

- NLP- Architecture

Massimiliano CiaramitaMihai Surdeanu

Roi Blanco

- IR

Sebastiano Vigna (U. Milan)Paolo Boldi

- Indexing (MG4J)

Peter Mika

- Semantic Web

Page 5: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

5Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

Page 6: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

6Seminari UB 2010

NLR: NLP for IR

• Shallow NLP processing as additional information for Information Retrieval task

• Explore New ways of Browsing• Support your answers

• Hypothesis/Requierements:Linear extraction/parsing time (50K w x m x s)error-prone output (e.g. 60-90%)highly redundant information

Page 7: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

7Seminari UB 2010

Support your answers!

Humans need to “verify” unknown facts Multiple sources of evidence Common sense vs. Contradictions

are you sure? is this spam? Interesting! Their tolerance to errors greatly increases if they can

verify things fast Importance of snippets, image search

Often the context is as important as the fact E.g. “S discovered the transisitor in X in 19X.”

There are different kinds of errors Ridiculous result (decreases overall confidence in system) Reasonably wrong result (makes us feel good) Partial result, etc.

Page 8: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

8Seminari UB 2010

New ways of Browsing…

Page 9: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

9Seminari UB 2010

Open canvas for HTML

Page 10: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

10Seminari UB 2010

Related pages based on metadata

Conferences he plans to attend

and other events from homepage

plus bio events from LinkedIn

Geolocation

query: ivan herman

Micro Search, example: ivan herman

©Peter Mika

Page 11: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

11Seminari UB 2010

MicroSearch, example: san francisco conference

Conferences in San Francisco date

©Peter Mika

Page 12: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

12Seminari UB 2010

What can we do with automatic annotations?

Taxonomizing the world is… hopeless? Closed domains (deep specialist tags)

Enzyme_name, battery_life, diagnosis, etc.

Open domains (shallow generalists tags) Person/Organization Name, Location, Time…

Page 13: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

13Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

Page 14: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

14Seminari UB 2010

NLR Architecture in Action

NLPPipeline

ForwardIndex

Tag Graph

Index

Search Engine

Graph Engine(Java, WebGraph)

NLR Search Engine

RMI & REST APIs

YourKiller

Application!

Corpus ------ ------ -------- -------- -------

C++ / Java / REST APIs

Page 15: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

15Seminari UB 2010

NLP Pipeline

Segmentation,Tokenisation POS Word

SemanticsNamed Entities

Dependency Parser

AnaphoraResolution

CorpusAdaptation

hadoopize,

CodeDocum.

&Support

MultilingualSupport

SentimentAnalysis

Gazeteers SRL PipelineService

Open Source

SuperSense Tagger

Page 16: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

16Seminari UB 2010

Example of NLP tech.

Page 17: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

17Seminari UB 2010

Semantic Tagging

Page 18: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

18Seminari UB 2010

Semantic Tagging

Page 19: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

19Seminari UB 2010

Dependency parsing

Page 20: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

20Seminari UB 2010

Data Structures• Multiple overlapping bracketed strings (BIO-Format)

tokens matches deprel [ DEP/SUB:* LEMMA/win:*]

OBJNMODVMODSUBNMODDEP

35032HEAD

B-noun.act0B-verb.competitionB-noun.person0WNSS

electionthiswintorytheLEMMA

NNDTVBDNNPSDTPOS

electionthiswonToriesTheTERM

More semantic tags: WSJ, CONLL, …

Tags can be used in retrieving:

Page 21: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

21Seminari UB 2010

Colored (Typed) Indices

Doc #5: Hope claims that in 1994 she run to Peter Town.

Peter D3:1, D5:9Town D5:10Hope D5:11994 D5:5…

Doc #3: The last time Peter exercised was in the XXth century.

Possible Queries: “Peter ^ run” “Peter ^ WNS:N_DATE” “(WSJ:CITY:*) ^ run” “(WSJ:PERSON:Hope) ^ run”

WSJ:PERSON D3:1, D5:1WSJ:CITY D5:9WNS:V_DATE D5:5

(Bracketing can also be dealt with)

Page 22: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

22Seminari UB 2010

Entity containment graphs

#3

#5

WSJ:PERSON: “Peter”

WSJ:PERSON: “Hope”

WSJ:CITY: “Peter Town”

WNS:DATE: “XXth century”

WNS:DATE:” 1994”

Doc #5: Hope claims that in 1994 she run to Peter Town.

Doc #3: The last time Peter exercised was in the XXth century.

[Zaragoza et. al. CIKM’08]

Page 23: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

23Seminari UB 2010

Entity Conatinment Graph

Pablo Picasso and the Second World War

SearchEngine

Sentences

Sentence to Entity Map

Page 24: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

24Seminari UB 2010

Entity Ranking on the containment graph

Indegree Weighted-Indegree

Sentence weight: similarity Entity weight: idf Link: strength

Finite Random Walks

(Personalised) Page Rank

Page 25: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

25Seminari UB 2010

Algorithm Evaluation Users judge quality of entities on queries they

know well 50 queries ( 5x10 judges )

SW0 : 6.2K English Wikipedia entries.

28M unique entities, 5.5M occurrences.

Page 26: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

26Seminari UB 2010

Algorithm Evaluation,examples:

Page 27: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

27Seminari UB 2010

Domain Adaptation Are those NE models good for me?

Type of text Type of Named Entities

Can I improve them for my particular task?

Adding Gazetteers Training a new model (needs training corpora)

Page 28: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

28Seminari UB 2010

Building Gazetteers The same Named Entity could have different

types depending on the context

E.g. Barcelona as PERSON o ORGANIZATION E.g. Aberdeen as Lake or City

Thus instead of imposing a unique type, we will add the Gazetteer as features

Page 29: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

29Seminari UB 2010

Page 30: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

30Seminari UB 2010

Page 31: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

31Seminari UB 2010

Learning to Tag, Tagging to Learn

Page 32: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

32Seminari UB 2010

Extending annotations

Pablo Picasso was born in Málaga, Spain.LOC LOCPER

artist:name artist:placeofbirth artist:placeofbirth

Pablo_Picasso Spain

artist

artist_placeofbirth

wikiPageUsesTemplate

Málagaartist_placeofbirth

describes

type

conll:PERSON

range

type

conll:LOCATION

E:PERSON GPE:CITY GPE:COUNTRY

[Mika, Zaragoza, Ciaramita, Atserias. IEEE AI 2008]

Page 33: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

33Seminari UB 2010

Adaptation and improved tagging

Generate training data based on what is learned i.e. if Paris is also a place of birth, it is also a place

Dealing with sparse dataParis is the capital city of France.

Improvement On news: +6.1% On Wikipedia: +5.5%

Page 34: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

34Seminari UB 2010

(property typing examples)

Wikipedia property CONLL Examples of property values

infoboxNbaPlayer_name PERAlex Groza, Elgin Baylor, Jerry West, David Thompson, Glen Rice, Christian Laettner, Richard Hamilton, Juan Dixon, Sean May, Joakim Noah, ...

infoboxSerialKiller_alias ORGGray Man, the Werewolf of Wysteria, Brooklyn Vampire, Sister, Brian Stewart, Bloody Benders, The Prostishooter, The Rest Stop Killer, The Truck Stop Killer, The Sunset Strip Killer, Cincinnati Strangler, Son of Sam, Plainfield Ghoul, Ed " Psycho" Gein, The Co-ed Killer, ...

highlanderCharacter_born LOCUnknown, unknown, 1659, 1945, 802, 1887, 1950, , Glenfinnan, Scotland, (original birth date unknown) (''Highlander II''), 896 BC, Ancient Egypt (original birth date unknown) (''Highlander II''), California, ...

infoboxSuperbowl_stadium LOCSun Devil Stadium, Georgia Dome, Miami Orange Bowl, Hubert H. Humphrey Metrodome, Dolphin Stadium, Raymond James Stadium, Louisiana Superdome, Joe Robbie Stadium, Ford Field, Los Angeles Memorial Coliseum, ...

infoboxWeapon_usedBy LOC USA, None, One, none, Italy, United States, Mexico, UK, Russia, Under development, ...

minorLeagueTeam_league MISCEastern League (1923-37, 1940-63, 1967-68, 1992- ), Pacific Coast League, Arizona League, Texas League, South Atlantic League, California League, Midwest League, Northwest League, International League, Carolina League, ...

infoboxTea_teaOrigin LOCNuwara Eliya, Sri Lanka near Adam's Peak between 2200 - 2500 metres, Japan, India, Vietnam, Taiwan, Turkey, China, Anhui, Guangdong, Jiangxi, ...

infoboxPrimeMinister_name PERAbdallah El-Yafi, Umar al-Muntasir, Dr. Abdellatif Filali, Abderrahmane Youssoufi, Abdessalam Jalloud, Abdul Ati al-Obeidi, Abdul Hamid al-Bakkoush, Abdul Majid Kubar, Abdul Majid al-Qaud, Abd al-Qadir al-Badri, ...

Page 35: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

35Seminari UB 2010

Publicity Break ;) Zaragoza et.al. Semantically Annotated Snapshot of the

English Wikipedia v.1 (SW1) [LREC'2008]– 1,490,688 entries

– Available through the WebScope Program

– Open Source Technology

– Distribution (Licensed under GNU Free Documentation License )

Atserias et al. Viquipèdia Catalana Semànticament Etiquetada 1.0. Jornada del Processament Computacional del Català. 2009

– 159.716. entries (dump del 23/01/2009)

– Available at http://www.glicom.upf.edu/publicacions/recursos

– Open Source Technology

– Distribution (Licensed under GNU Free Documentation License )

Future releases (we already have new dump of the wikipedia): Improved tagging performance

Page 36: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

36Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

Page 37: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

37Seminari UB 2010

Two and a half Applications showing NLP for IR

Correlator (NEs for new ways of browsing)

Quest (browsing Q&As)

Financial News, Correlator predicting the future

Page 38: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

38Seminari UB 2010

Correlator

Live at http://correlator.sandbox.yahoo.net Launched on Feb. 2009 Receives 50K queries per month.

Page 39: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

39Seminari UB 2010

Correlator

English Wikipedia, sep. 2008 53.408.510 sentences and 26.110.586 typed

entities automatically tagged. Finds related people, organizations, locations,

events, or general semantic concepts such as nationalities, materials, food, works of art, etc.

Searching concepts that are distributed across several entries, for instance: Dinosaurs in Argentina, Picasso and Peace.

Page 40: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

40Seminari UB 2010

Synthetic Page

Page 41: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

41Seminari UB 2010

Page 42: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

42Seminari UB 2010

Page 43: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

43Seminari UB 2010

Page 44: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

44Seminari UB 2010

Page 45: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

45Seminari UB 2010

Yahoo! Quest

• Public demo on yahoo! Sandbox (Nov. 2009)

• For helping Q&As browsing

• Using a 4.5M question-answers from yahoo! Answers site (distributed though Webscope)

• Try it at http://quest.sandbox.yahoo.net

Page 46: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

46Seminari UB 2010

Yahoo! Quest

Page 47: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

47Seminari UB 2010

Yahoo! Quest

• Preprocessing

– NLP pipeline

– Extract phrases

– Build a fast Index

• Online

– Retrieve documents

– Retrieve phrases

– Rang phrases

Page 48: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

48Seminari UB 2010

Extracting Key-phrases• Non Phraseness

– be benefits

– What would

– benefits of

– the

• Phraseness

– benefits

– free college education

Page 49: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

49Seminari UB 2010

Extracting Key-phrases• Non Informative

– be

– free

– (?) the benefits

– (?) the benefits of free college education

• Informative

– College education

– Free education

Page 50: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

50Seminari UB 2010

Phrase ExtractionTerm extraction proceeds in two steps:

• extract the entire phrase under a head

• then extract all bigrams target-head

• clean up (stopword list)

Can we group the phrases?

• We distinguished different “types” of phrases e.g. nominal phrases or verbal ones

• We group phrases using the head's PoS and Dependency Label

Page 51: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

51Seminari UB 2010

Grouping Phrases

Dep Label PoS Type

sbj NN N

root VBP V

amod CD Q

adv RB M

nmod JJ K

Page 52: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

52Seminari UB 2010

Phrase TypesClassify phrases by “type”:

– Noun (N): e.g. market_crash, my_next_baby

– Verb (V): e.g. plan, get_controller_instance

– Manner (M): e.g. ever, on_the_hill

– Question (Q): e.g. any_ of_the_store, a ,my

– Many (K): e.g. many, cute, very\_good

New Group: phrases containing parts of the query seems more relevant

Page 53: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

53Seminari UB 2010

Some Figures

type phrases unique phrases

V 11.850.760 691.406

N 12.364.631 3.583.684

M 9.446.031 1.908.610

Q 5.355.686 170.205

K 2.618.146 319.577

41.635.254 6.673.482

Page 54: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

54Seminari UB 2010

Many Open Issues

• Parsing adaptation for questions

• Key phrase extraction

• Key phrase ranking

• Evaluation Framework

Page 55: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

55Seminari UB 2010

Parsing Questions• Verbose, complex and long questions

• Text is not always grammatical

• Text is noisy

• Few question on the training (WSJ Penn Treebank contains 3,553 questions, about 0.75%)

LOTS OF ERRORS! (79% LAS)

Page 56: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

56Seminari UB 2010

Parsing Questions

• Not have resources to re-label/re-train

• Not interested in the complete parse tree

• Transform questions into statements

• Using a set of about 70 simple patterns

– what kind of => this kind of

– Does anyone have => you have

• This resulted in significantly better parsings and much better phrases.

Page 57: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

57Seminari UB 2010

Parsing adaptation

We are currently pursuing research in parser adaptation for questions, and have publicly released a corrected PoS and dependency data set of questions from Yahoo! Answers to support this research (Atserias et al., 2010).

Only with about 1000 questions we reach similar performance for questions (87% LAS) without hurting the general results (86% LAS).

Page 58: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

58Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

Page 59: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

59Seminari UB 2010

Living Knowledge

• European Project (7th Framework)

• http://livingknowledge-project.eu/

• WP7: Predicting the future

Page 60: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

60Seminari UB 2010

Living Knowledge

http://livingknowledge-project.eu/

Page 61: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

61Seminari UB 2010

Page 62: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

62Seminari UB 2010

Page 63: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

63Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

Page 64: 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

64Seminari UB 2010

Alguna pregunta?

Gràcies per la vostre atenció