1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .
-
Upload
janice-mcbride -
Category
Documents
-
view
213 -
download
1
Transcript of 1 Seminari UB 2010 NLP Research at Yahoo! Barcelona Jordi Atserias .
1Seminari UB 2010
NLP Research at Yahoo! BarcelonaJordi Atserias
http://labs.yahoo.com/Yahoo_Labs_Barcelona
2Seminari UB 2010
Summary
1. Presentation of the Yahoo! Barcelona NLR group
2. Aims of the group
3. NLR Architecture
4. Domain Adaptation
5. Applications
1. Correlator
2. Quest
6. Living Knowledge (Ad)
3Seminari UB 2010
Hugo Zaragoza
Ricardo Baeza
http://research.yahoo.com/Yahoo_Research_Barcelona
4Seminari UB 2010
Natural Language Retrieval Group
Giuseppe Attardi (U. Pisa)
- Indexing (IXE) - Dependency Parsing
Hugo Zaragoza - ML- IR
“Every time I fire a linguist my performance goes up…” (Fred Jelinek)
Great strategy until you’ve fired them all… but what then?
Michael Matthews - NLP
Jordi Atserias
- NLP- Architecture
Massimiliano CiaramitaMihai Surdeanu
Roi Blanco
- IR
Sebastiano Vigna (U. Milan)Paolo Boldi
- Indexing (MG4J)
Peter Mika
- Semantic Web
5Seminari UB 2010
Summary
1. Presentation of the Yahoo! Barcelona NLR group
2. Aims of the group
3. NLR Architecture
4. Domain Adaptation
5. Applications
1. Correlator
2. Quest
6. Living Knowledge (Ad)
6Seminari UB 2010
NLR: NLP for IR
• Shallow NLP processing as additional information for Information Retrieval task
• Explore New ways of Browsing• Support your answers
• Hypothesis/Requierements:Linear extraction/parsing time (50K w x m x s)error-prone output (e.g. 60-90%)highly redundant information
7Seminari UB 2010
Support your answers!
Humans need to “verify” unknown facts Multiple sources of evidence Common sense vs. Contradictions
are you sure? is this spam? Interesting! Their tolerance to errors greatly increases if they can
verify things fast Importance of snippets, image search
Often the context is as important as the fact E.g. “S discovered the transisitor in X in 19X.”
There are different kinds of errors Ridiculous result (decreases overall confidence in system) Reasonably wrong result (makes us feel good) Partial result, etc.
8Seminari UB 2010
New ways of Browsing…
9Seminari UB 2010
Open canvas for HTML
10Seminari UB 2010
Related pages based on metadata
Conferences he plans to attend
and other events from homepage
plus bio events from LinkedIn
Geolocation
query: ivan herman
Micro Search, example: ivan herman
©Peter Mika
11Seminari UB 2010
MicroSearch, example: san francisco conference
Conferences in San Francisco date
©Peter Mika
12Seminari UB 2010
What can we do with automatic annotations?
Taxonomizing the world is… hopeless? Closed domains (deep specialist tags)
Enzyme_name, battery_life, diagnosis, etc.
Open domains (shallow generalists tags) Person/Organization Name, Location, Time…
13Seminari UB 2010
Summary
1. Presentation of the Yahoo! Barcelona NLR group
2. Aims of the group
3. NLR Architecture
4. Domain Adaptation
5. Applications
1. Correlator
2. Quest
6. Living Knowledge (Ad)
14Seminari UB 2010
NLR Architecture in Action
NLPPipeline
ForwardIndex
Tag Graph
Index
Search Engine
Graph Engine(Java, WebGraph)
NLR Search Engine
RMI & REST APIs
YourKiller
Application!
Corpus ------ ------ -------- -------- -------
C++ / Java / REST APIs
15Seminari UB 2010
NLP Pipeline
Segmentation,Tokenisation POS Word
SemanticsNamed Entities
Dependency Parser
AnaphoraResolution
CorpusAdaptation
hadoopize,
CodeDocum.
&Support
MultilingualSupport
SentimentAnalysis
Gazeteers SRL PipelineService
Open Source
SuperSense Tagger
16Seminari UB 2010
Example of NLP tech.
17Seminari UB 2010
Semantic Tagging
18Seminari UB 2010
Semantic Tagging
19Seminari UB 2010
Dependency parsing
20Seminari UB 2010
Data Structures• Multiple overlapping bracketed strings (BIO-Format)
tokens matches deprel [ DEP/SUB:* LEMMA/win:*]
OBJNMODVMODSUBNMODDEP
35032HEAD
B-noun.act0B-verb.competitionB-noun.person0WNSS
electionthiswintorytheLEMMA
NNDTVBDNNPSDTPOS
electionthiswonToriesTheTERM
More semantic tags: WSJ, CONLL, …
Tags can be used in retrieving:
21Seminari UB 2010
Colored (Typed) Indices
Doc #5: Hope claims that in 1994 she run to Peter Town.
Peter D3:1, D5:9Town D5:10Hope D5:11994 D5:5…
Doc #3: The last time Peter exercised was in the XXth century.
Possible Queries: “Peter ^ run” “Peter ^ WNS:N_DATE” “(WSJ:CITY:*) ^ run” “(WSJ:PERSON:Hope) ^ run”
WSJ:PERSON D3:1, D5:1WSJ:CITY D5:9WNS:V_DATE D5:5
(Bracketing can also be dealt with)
22Seminari UB 2010
Entity containment graphs
#3
#5
…
WSJ:PERSON: “Peter”
WSJ:PERSON: “Hope”
WSJ:CITY: “Peter Town”
WNS:DATE: “XXth century”
WNS:DATE:” 1994”
Doc #5: Hope claims that in 1994 she run to Peter Town.
Doc #3: The last time Peter exercised was in the XXth century.
[Zaragoza et. al. CIKM’08]
23Seminari UB 2010
Entity Conatinment Graph
Pablo Picasso and the Second World War
SearchEngine
Sentences
Sentence to Entity Map
24Seminari UB 2010
Entity Ranking on the containment graph
Indegree Weighted-Indegree
Sentence weight: similarity Entity weight: idf Link: strength
Finite Random Walks
(Personalised) Page Rank
25Seminari UB 2010
Algorithm Evaluation Users judge quality of entities on queries they
know well 50 queries ( 5x10 judges )
SW0 : 6.2K English Wikipedia entries.
28M unique entities, 5.5M occurrences.
26Seminari UB 2010
Algorithm Evaluation,examples:
27Seminari UB 2010
Domain Adaptation Are those NE models good for me?
Type of text Type of Named Entities
Can I improve them for my particular task?
Adding Gazetteers Training a new model (needs training corpora)
28Seminari UB 2010
Building Gazetteers The same Named Entity could have different
types depending on the context
E.g. Barcelona as PERSON o ORGANIZATION E.g. Aberdeen as Lake or City
Thus instead of imposing a unique type, we will add the Gazetteer as features
29Seminari UB 2010
30Seminari UB 2010
31Seminari UB 2010
Learning to Tag, Tagging to Learn
32Seminari UB 2010
Extending annotations
Pablo Picasso was born in Málaga, Spain.LOC LOCPER
artist:name artist:placeofbirth artist:placeofbirth
Pablo_Picasso Spain
artist
artist_placeofbirth
wikiPageUsesTemplate
Málagaartist_placeofbirth
describes
type
conll:PERSON
range
type
conll:LOCATION
E:PERSON GPE:CITY GPE:COUNTRY
[Mika, Zaragoza, Ciaramita, Atserias. IEEE AI 2008]
33Seminari UB 2010
Adaptation and improved tagging
Generate training data based on what is learned i.e. if Paris is also a place of birth, it is also a place
Dealing with sparse dataParis is the capital city of France.
Improvement On news: +6.1% On Wikipedia: +5.5%
34Seminari UB 2010
(property typing examples)
Wikipedia property CONLL Examples of property values
infoboxNbaPlayer_name PERAlex Groza, Elgin Baylor, Jerry West, David Thompson, Glen Rice, Christian Laettner, Richard Hamilton, Juan Dixon, Sean May, Joakim Noah, ...
infoboxSerialKiller_alias ORGGray Man, the Werewolf of Wysteria, Brooklyn Vampire, Sister, Brian Stewart, Bloody Benders, The Prostishooter, The Rest Stop Killer, The Truck Stop Killer, The Sunset Strip Killer, Cincinnati Strangler, Son of Sam, Plainfield Ghoul, Ed " Psycho" Gein, The Co-ed Killer, ...
highlanderCharacter_born LOCUnknown, unknown, 1659, 1945, 802, 1887, 1950, , Glenfinnan, Scotland, (original birth date unknown) (''Highlander II''), 896 BC, Ancient Egypt (original birth date unknown) (''Highlander II''), California, ...
infoboxSuperbowl_stadium LOCSun Devil Stadium, Georgia Dome, Miami Orange Bowl, Hubert H. Humphrey Metrodome, Dolphin Stadium, Raymond James Stadium, Louisiana Superdome, Joe Robbie Stadium, Ford Field, Los Angeles Memorial Coliseum, ...
infoboxWeapon_usedBy LOC USA, None, One, none, Italy, United States, Mexico, UK, Russia, Under development, ...
minorLeagueTeam_league MISCEastern League (1923-37, 1940-63, 1967-68, 1992- ), Pacific Coast League, Arizona League, Texas League, South Atlantic League, California League, Midwest League, Northwest League, International League, Carolina League, ...
infoboxTea_teaOrigin LOCNuwara Eliya, Sri Lanka near Adam's Peak between 2200 - 2500 metres, Japan, India, Vietnam, Taiwan, Turkey, China, Anhui, Guangdong, Jiangxi, ...
infoboxPrimeMinister_name PERAbdallah El-Yafi, Umar al-Muntasir, Dr. Abdellatif Filali, Abderrahmane Youssoufi, Abdessalam Jalloud, Abdul Ati al-Obeidi, Abdul Hamid al-Bakkoush, Abdul Majid Kubar, Abdul Majid al-Qaud, Abd al-Qadir al-Badri, ...
35Seminari UB 2010
Publicity Break ;) Zaragoza et.al. Semantically Annotated Snapshot of the
English Wikipedia v.1 (SW1) [LREC'2008]– 1,490,688 entries
– Available through the WebScope Program
– Open Source Technology
– Distribution (Licensed under GNU Free Documentation License )
Atserias et al. Viquipèdia Catalana Semànticament Etiquetada 1.0. Jornada del Processament Computacional del Català. 2009
– 159.716. entries (dump del 23/01/2009)
– Available at http://www.glicom.upf.edu/publicacions/recursos
– Open Source Technology
– Distribution (Licensed under GNU Free Documentation License )
Future releases (we already have new dump of the wikipedia): Improved tagging performance
36Seminari UB 2010
Summary
1. Presentation of the Yahoo! Barcelona NLR group
2. Aims of the group
3. NLR Architecture
4. Domain Adaptation
5. Applications
1. Correlator
2. Quest
6. Living Knowledge (Ad)
37Seminari UB 2010
Two and a half Applications showing NLP for IR
Correlator (NEs for new ways of browsing)
Quest (browsing Q&As)
Financial News, Correlator predicting the future
38Seminari UB 2010
Correlator
Live at http://correlator.sandbox.yahoo.net Launched on Feb. 2009 Receives 50K queries per month.
39Seminari UB 2010
Correlator
English Wikipedia, sep. 2008 53.408.510 sentences and 26.110.586 typed
entities automatically tagged. Finds related people, organizations, locations,
events, or general semantic concepts such as nationalities, materials, food, works of art, etc.
Searching concepts that are distributed across several entries, for instance: Dinosaurs in Argentina, Picasso and Peace.
40Seminari UB 2010
Synthetic Page
41Seminari UB 2010
42Seminari UB 2010
43Seminari UB 2010
44Seminari UB 2010
45Seminari UB 2010
Yahoo! Quest
• Public demo on yahoo! Sandbox (Nov. 2009)
• For helping Q&As browsing
• Using a 4.5M question-answers from yahoo! Answers site (distributed though Webscope)
• Try it at http://quest.sandbox.yahoo.net
46Seminari UB 2010
Yahoo! Quest
47Seminari UB 2010
Yahoo! Quest
• Preprocessing
– NLP pipeline
– Extract phrases
– Build a fast Index
• Online
– Retrieve documents
– Retrieve phrases
– Rang phrases
48Seminari UB 2010
Extracting Key-phrases• Non Phraseness
– be benefits
– What would
– benefits of
– the
• Phraseness
– benefits
– free college education
49Seminari UB 2010
Extracting Key-phrases• Non Informative
– be
– free
– (?) the benefits
– (?) the benefits of free college education
• Informative
– College education
– Free education
50Seminari UB 2010
Phrase ExtractionTerm extraction proceeds in two steps:
• extract the entire phrase under a head
• then extract all bigrams target-head
• clean up (stopword list)
Can we group the phrases?
• We distinguished different “types” of phrases e.g. nominal phrases or verbal ones
• We group phrases using the head's PoS and Dependency Label
51Seminari UB 2010
Grouping Phrases
Dep Label PoS Type
sbj NN N
root VBP V
amod CD Q
adv RB M
nmod JJ K
52Seminari UB 2010
Phrase TypesClassify phrases by “type”:
– Noun (N): e.g. market_crash, my_next_baby
– Verb (V): e.g. plan, get_controller_instance
– Manner (M): e.g. ever, on_the_hill
– Question (Q): e.g. any_ of_the_store, a ,my
– Many (K): e.g. many, cute, very\_good
New Group: phrases containing parts of the query seems more relevant
53Seminari UB 2010
Some Figures
type phrases unique phrases
V 11.850.760 691.406
N 12.364.631 3.583.684
M 9.446.031 1.908.610
Q 5.355.686 170.205
K 2.618.146 319.577
41.635.254 6.673.482
54Seminari UB 2010
Many Open Issues
• Parsing adaptation for questions
• Key phrase extraction
• Key phrase ranking
• Evaluation Framework
55Seminari UB 2010
Parsing Questions• Verbose, complex and long questions
• Text is not always grammatical
• Text is noisy
• Few question on the training (WSJ Penn Treebank contains 3,553 questions, about 0.75%)
LOTS OF ERRORS! (79% LAS)
56Seminari UB 2010
Parsing Questions
• Not have resources to re-label/re-train
• Not interested in the complete parse tree
• Transform questions into statements
• Using a set of about 70 simple patterns
– what kind of => this kind of
– Does anyone have => you have
• This resulted in significantly better parsings and much better phrases.
57Seminari UB 2010
Parsing adaptation
We are currently pursuing research in parser adaptation for questions, and have publicly released a corrected PoS and dependency data set of questions from Yahoo! Answers to support this research (Atserias et al., 2010).
Only with about 1000 questions we reach similar performance for questions (87% LAS) without hurting the general results (86% LAS).
58Seminari UB 2010
Summary
1. Presentation of the Yahoo! Barcelona NLR group
2. Aims of the group
3. NLR Architecture
4. Domain Adaptation
5. Applications
1. Correlator
2. Quest
6. Living Knowledge (Ad)
59Seminari UB 2010
Living Knowledge
• European Project (7th Framework)
• http://livingknowledge-project.eu/
• WP7: Predicting the future
60Seminari UB 2010
Living Knowledge
http://livingknowledge-project.eu/
61Seminari UB 2010
62Seminari UB 2010
63Seminari UB 2010
Summary
1. Presentation of the Yahoo! Barcelona NLR group
2. Aims of the group
3. NLR Architecture
4. Domain Adaptation
5. Applications
1. Correlator
2. Quest
6. Living Knowledge (Ad)
64Seminari UB 2010
Alguna pregunta?
Gràcies per la vostre atenció