1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

Post on 16-Jan-2016

213 views 1 download

Tags:

Transcript of 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .

1Seminari UB 2010

NLP Research at Yahoo! BarcelonaJordi Atserias

http://labs.yahoo.com/Yahoo_Labs_Barcelona

2Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

3Seminari UB 2010

Hugo Zaragoza

Ricardo Baeza

http://research.yahoo.com/Yahoo_Research_Barcelona

4Seminari UB 2010

Natural Language Retrieval Group

Giuseppe Attardi (U. Pisa)

- Indexing (IXE) - Dependency Parsing

Hugo Zaragoza - ML- IR

“Every time I fire a linguist my performance goes up…” (Fred Jelinek)

Great strategy until you’ve fired them all… but what then?

Michael Matthews - NLP

Jordi Atserias

- NLP- Architecture

Massimiliano CiaramitaMihai Surdeanu

Roi Blanco

- IR

Sebastiano Vigna (U. Milan)Paolo Boldi

- Indexing (MG4J)

Peter Mika

- Semantic Web

5Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

6Seminari UB 2010

NLR: NLP for IR

• Shallow NLP processing as additional information for Information Retrieval task

• Explore New ways of Browsing• Support your answers

• Hypothesis/Requierements:Linear extraction/parsing time (50K w x m x s)error-prone output (e.g. 60-90%)highly redundant information

7Seminari UB 2010

Support your answers!

Humans need to “verify” unknown facts Multiple sources of evidence Common sense vs. Contradictions

are you sure? is this spam? Interesting! Their tolerance to errors greatly increases if they can

verify things fast Importance of snippets, image search

Often the context is as important as the fact E.g. “S discovered the transisitor in X in 19X.”

There are different kinds of errors Ridiculous result (decreases overall confidence in system) Reasonably wrong result (makes us feel good) Partial result, etc.

8Seminari UB 2010

New ways of Browsing…

9Seminari UB 2010

Open canvas for HTML

10Seminari UB 2010

Related pages based on metadata

Conferences he plans to attend

and other events from homepage

plus bio events from LinkedIn

Geolocation

query: ivan herman

Micro Search, example: ivan herman

©Peter Mika

11Seminari UB 2010

MicroSearch, example: san francisco conference

Conferences in San Francisco date

©Peter Mika

12Seminari UB 2010

What can we do with automatic annotations?

Taxonomizing the world is… hopeless? Closed domains (deep specialist tags)

Enzyme_name, battery_life, diagnosis, etc.

Open domains (shallow generalists tags) Person/Organization Name, Location, Time…

13Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

14Seminari UB 2010

NLR Architecture in Action

NLPPipeline

ForwardIndex

Tag Graph

Index

Search Engine

Graph Engine(Java, WebGraph)

NLR Search Engine

RMI & REST APIs

YourKiller

Application!

Corpus ------ ------ -------- -------- -------

C++ / Java / REST APIs

15Seminari UB 2010

NLP Pipeline

Segmentation,Tokenisation POS Word

SemanticsNamed Entities

Dependency Parser

AnaphoraResolution

CorpusAdaptation

hadoopize,

CodeDocum.

&Support

MultilingualSupport

SentimentAnalysis

Gazeteers SRL PipelineService

Open Source

SuperSense Tagger

16Seminari UB 2010

Example of NLP tech.

17Seminari UB 2010

Semantic Tagging

18Seminari UB 2010

Semantic Tagging

19Seminari UB 2010

Dependency parsing

20Seminari UB 2010

Data Structures• Multiple overlapping bracketed strings (BIO-Format)

tokens matches deprel [ DEP/SUB:* LEMMA/win:*]

OBJNMODVMODSUBNMODDEP

35032HEAD

B-noun.act0B-verb.competitionB-noun.person0WNSS

electionthiswintorytheLEMMA

NNDTVBDNNPSDTPOS

electionthiswonToriesTheTERM

More semantic tags: WSJ, CONLL, …

Tags can be used in retrieving:

21Seminari UB 2010

Colored (Typed) Indices

Doc #5: Hope claims that in 1994 she run to Peter Town.

Peter D3:1, D5:9Town D5:10Hope D5:11994 D5:5…

Doc #3: The last time Peter exercised was in the XXth century.

Possible Queries: “Peter ^ run” “Peter ^ WNS:N_DATE” “(WSJ:CITY:*) ^ run” “(WSJ:PERSON:Hope) ^ run”

WSJ:PERSON D3:1, D5:1WSJ:CITY D5:9WNS:V_DATE D5:5

(Bracketing can also be dealt with)

22Seminari UB 2010

Entity containment graphs

#3

#5

WSJ:PERSON: “Peter”

WSJ:PERSON: “Hope”

WSJ:CITY: “Peter Town”

WNS:DATE: “XXth century”

WNS:DATE:” 1994”

Doc #5: Hope claims that in 1994 she run to Peter Town.

Doc #3: The last time Peter exercised was in the XXth century.

[Zaragoza et. al. CIKM’08]

23Seminari UB 2010

Entity Conatinment Graph

Pablo Picasso and the Second World War

SearchEngine

Sentences

Sentence to Entity Map

24Seminari UB 2010

Entity Ranking on the containment graph

Indegree Weighted-Indegree

Sentence weight: similarity Entity weight: idf Link: strength

Finite Random Walks

(Personalised) Page Rank

25Seminari UB 2010

Algorithm Evaluation Users judge quality of entities on queries they

know well 50 queries ( 5x10 judges )

SW0 : 6.2K English Wikipedia entries.

28M unique entities, 5.5M occurrences.

26Seminari UB 2010

Algorithm Evaluation,examples:

27Seminari UB 2010

Domain Adaptation Are those NE models good for me?

Type of text Type of Named Entities

Can I improve them for my particular task?

Adding Gazetteers Training a new model (needs training corpora)

28Seminari UB 2010

Building Gazetteers The same Named Entity could have different

types depending on the context

E.g. Barcelona as PERSON o ORGANIZATION E.g. Aberdeen as Lake or City

Thus instead of imposing a unique type, we will add the Gazetteer as features

29Seminari UB 2010

30Seminari UB 2010

31Seminari UB 2010

Learning to Tag, Tagging to Learn

32Seminari UB 2010

Extending annotations

Pablo Picasso was born in Málaga, Spain.LOC LOCPER

artist:name artist:placeofbirth artist:placeofbirth

Pablo_Picasso Spain

artist

artist_placeofbirth

wikiPageUsesTemplate

Málagaartist_placeofbirth

describes

type

conll:PERSON

range

type

conll:LOCATION

E:PERSON GPE:CITY GPE:COUNTRY

[Mika, Zaragoza, Ciaramita, Atserias. IEEE AI 2008]

33Seminari UB 2010

Adaptation and improved tagging

Generate training data based on what is learned i.e. if Paris is also a place of birth, it is also a place

Dealing with sparse dataParis is the capital city of France.

Improvement On news: +6.1% On Wikipedia: +5.5%

34Seminari UB 2010

(property typing examples)

Wikipedia property CONLL Examples of property values

infoboxNbaPlayer_name PERAlex Groza, Elgin Baylor, Jerry West, David Thompson, Glen Rice, Christian Laettner, Richard Hamilton, Juan Dixon, Sean May, Joakim Noah, ...

infoboxSerialKiller_alias ORGGray Man, the Werewolf of Wysteria, Brooklyn Vampire, Sister, Brian Stewart, Bloody Benders, The Prostishooter, The Rest Stop Killer, The Truck Stop Killer, The Sunset Strip Killer, Cincinnati Strangler, Son of Sam, Plainfield Ghoul, Ed " Psycho" Gein, The Co-ed Killer, ...

highlanderCharacter_born LOCUnknown, unknown, 1659, 1945, 802, 1887, 1950, , Glenfinnan, Scotland, (original birth date unknown) (''Highlander II''), 896 BC, Ancient Egypt (original birth date unknown) (''Highlander II''), California, ...

infoboxSuperbowl_stadium LOCSun Devil Stadium, Georgia Dome, Miami Orange Bowl, Hubert H. Humphrey Metrodome, Dolphin Stadium, Raymond James Stadium, Louisiana Superdome, Joe Robbie Stadium, Ford Field, Los Angeles Memorial Coliseum, ...

infoboxWeapon_usedBy LOC USA, None, One, none, Italy, United States, Mexico, UK, Russia, Under development, ...

minorLeagueTeam_league MISCEastern League (1923-37, 1940-63, 1967-68, 1992- ), Pacific Coast League, Arizona League, Texas League, South Atlantic League, California League, Midwest League, Northwest League, International League, Carolina League, ...

infoboxTea_teaOrigin LOCNuwara Eliya, Sri Lanka near Adam's Peak between 2200 - 2500 metres, Japan, India, Vietnam, Taiwan, Turkey, China, Anhui, Guangdong, Jiangxi, ...

infoboxPrimeMinister_name PERAbdallah El-Yafi, Umar al-Muntasir, Dr. Abdellatif Filali, Abderrahmane Youssoufi, Abdessalam Jalloud, Abdul Ati al-Obeidi, Abdul Hamid al-Bakkoush, Abdul Majid Kubar, Abdul Majid al-Qaud, Abd al-Qadir al-Badri, ...

35Seminari UB 2010

Publicity Break ;) Zaragoza et.al. Semantically Annotated Snapshot of the

English Wikipedia v.1 (SW1) [LREC'2008]– 1,490,688 entries

– Available through the WebScope Program

– Open Source Technology

– Distribution (Licensed under GNU Free Documentation License )

Atserias et al. Viquipèdia Catalana Semànticament Etiquetada 1.0. Jornada del Processament Computacional del Català. 2009

– 159.716. entries (dump del 23/01/2009)

– Available at http://www.glicom.upf.edu/publicacions/recursos

– Open Source Technology

– Distribution (Licensed under GNU Free Documentation License )

Future releases (we already have new dump of the wikipedia): Improved tagging performance

36Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

37Seminari UB 2010

Two and a half Applications showing NLP for IR

Correlator (NEs for new ways of browsing)

Quest (browsing Q&As)

Financial News, Correlator predicting the future

38Seminari UB 2010

Correlator

Live at http://correlator.sandbox.yahoo.net Launched on Feb. 2009 Receives 50K queries per month.

39Seminari UB 2010

Correlator

English Wikipedia, sep. 2008 53.408.510 sentences and 26.110.586 typed

entities automatically tagged. Finds related people, organizations, locations,

events, or general semantic concepts such as nationalities, materials, food, works of art, etc.

Searching concepts that are distributed across several entries, for instance: Dinosaurs in Argentina, Picasso and Peace.

40Seminari UB 2010

Synthetic Page

41Seminari UB 2010

42Seminari UB 2010

43Seminari UB 2010

44Seminari UB 2010

45Seminari UB 2010

Yahoo! Quest

• Public demo on yahoo! Sandbox (Nov. 2009)

• For helping Q&As browsing

• Using a 4.5M question-answers from yahoo! Answers site (distributed though Webscope)

• Try it at http://quest.sandbox.yahoo.net

46Seminari UB 2010

Yahoo! Quest

47Seminari UB 2010

Yahoo! Quest

• Preprocessing

– NLP pipeline

– Extract phrases

– Build a fast Index

• Online

– Retrieve documents

– Retrieve phrases

– Rang phrases

48Seminari UB 2010

Extracting Key-phrases• Non Phraseness

– be benefits

– What would

– benefits of

– the

• Phraseness

– benefits

– free college education

49Seminari UB 2010

Extracting Key-phrases• Non Informative

– be

– free

– (?) the benefits

– (?) the benefits of free college education

• Informative

– College education

– Free education

50Seminari UB 2010

Phrase ExtractionTerm extraction proceeds in two steps:

• extract the entire phrase under a head

• then extract all bigrams target-head

• clean up (stopword list)

Can we group the phrases?

• We distinguished different “types” of phrases e.g. nominal phrases or verbal ones

• We group phrases using the head's PoS and Dependency Label

51Seminari UB 2010

Grouping Phrases

Dep Label PoS Type

sbj NN N

root VBP V

amod CD Q

adv RB M

nmod JJ K

52Seminari UB 2010

Phrase TypesClassify phrases by “type”:

– Noun (N): e.g. market_crash, my_next_baby

– Verb (V): e.g. plan, get_controller_instance

– Manner (M): e.g. ever, on_the_hill

– Question (Q): e.g. any_ of_the_store, a ,my

– Many (K): e.g. many, cute, very\_good

New Group: phrases containing parts of the query seems more relevant

53Seminari UB 2010

Some Figures

type phrases unique phrases

V 11.850.760 691.406

N 12.364.631 3.583.684

M 9.446.031 1.908.610

Q 5.355.686 170.205

K 2.618.146 319.577

41.635.254 6.673.482

54Seminari UB 2010

Many Open Issues

• Parsing adaptation for questions

• Key phrase extraction

• Key phrase ranking

• Evaluation Framework

55Seminari UB 2010

Parsing Questions• Verbose, complex and long questions

• Text is not always grammatical

• Text is noisy

• Few question on the training (WSJ Penn Treebank contains 3,553 questions, about 0.75%)

LOTS OF ERRORS! (79% LAS)

56Seminari UB 2010

Parsing Questions

• Not have resources to re-label/re-train

• Not interested in the complete parse tree

• Transform questions into statements

• Using a set of about 70 simple patterns

– what kind of => this kind of

– Does anyone have => you have

• This resulted in significantly better parsings and much better phrases.

57Seminari UB 2010

Parsing adaptation

We are currently pursuing research in parser adaptation for questions, and have publicly released a corrected PoS and dependency data set of questions from Yahoo! Answers to support this research (Atserias et al., 2010).

Only with about 1000 questions we reach similar performance for questions (87% LAS) without hurting the general results (86% LAS).

58Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

59Seminari UB 2010

Living Knowledge

• European Project (7th Framework)

• http://livingknowledge-project.eu/

• WP7: Predicting the future

60Seminari UB 2010

Living Knowledge

http://livingknowledge-project.eu/

61Seminari UB 2010

62Seminari UB 2010

63Seminari UB 2010

Summary

1. Presentation of the Yahoo! Barcelona NLR group

2. Aims of the group

3. NLR Architecture

4. Domain Adaptation

5. Applications

1. Correlator

2. Quest

6. Living Knowledge (Ad)

64Seminari UB 2010

Alguna pregunta?

Gràcies per la vostre atenció