Challenges in nlp

36
Challenges in NLP Zareen Syed [email protected]

description

Presentation given at PyLadies meetup Montreal

Transcript of Challenges in nlp

Page 1: Challenges in nlp

Challenges in NLP

Zareen Syed

[email protected]

Page 2: Challenges in nlp

Ambiguity

• Natural language is highly ambiguous and must be disambiguated.

Page 3: Challenges in nlp

Ambiguity in Speech

• Speech Recognition – “recognize speech” vs. “wreck a nice beach”

– “youth in Asia” vs. “euthanasia”

Page 4: Challenges in nlp

1. I saw the man. The man was on the hill. I was using a telescope. 2. I saw the man. I was on the hill. I was using a telescope. 3. I saw the man. The man was on the hill. The hill had a telescope. 4. I saw the man. I was on the hill. The hill had a telescope. 5. I saw the man. The man was on the hill. I saw him using a telescope.

I saw the man on the hill with a telescope.

Ambiguity in Preposition Attachment

Page 5: Challenges in nlp

Humor and Ambiguity

• Many jokes rely on the ambiguity of language:

– One morning I shot an elephant in my pajamas. How he got into my pajamas, I’ll never know.

– She criticized my apartment, so I knocked her flat.

Page 6: Challenges in nlp

7

Polysemy Word Sense Disambiguation (WSD)

• Words in natural language usually have a fair number of different possible meanings. – Ellen has a strong interest in computational linguistics. – Ellen pays a large amount of interest on her credit card. – The dog is in the pen. – The ink is in the pen. – I put the plant in the window – Ford put the plant in Mexico

• For many tasks (question answering, translation), the

proper sense of each ambiguous word in a sentence must be determined.

Page 7: Challenges in nlp

Some more examples of Polysemy

A world record. A record of the conversation. Record it!

He left the bank five minutes ago. He left the bank five years ago He caught a fish at the bank.

I need some paper. I wrote a paper. I read the paper.

Page 8: Challenges in nlp
Page 9: Challenges in nlp

Computers are no better than your dog.

But we can teach them “how-to” by coding our knowledge of the language comprehension

process

Page 10: Challenges in nlp

Co-Reference Resolution

• Determine which phrases in a document refer to the same underlying entity. – John put the carrot on the plate and ate it.

– Bush started the war in Iraq. But the president needed the consent of Congress.

Page 11: Challenges in nlp

Ellipsis Resolution

• Frequently words and phrases are omitted from sentences when they can be inferred from context.

"Wise men talk because they have something to say; fools, because they have to say something.“ (Plato) "Wise men talk because they have something to say; fools talk because they have to say something.“ (Plato)

Page 12: Challenges in nlp

16

Information Extraction (IE)

• Identify phrases in language that refer to specific types of entities and relations in text.

• Named entity recognition is task of identifying names of people, places, organizations, etc. in text.

people organizations places – Michael Dell is the CEO of Dell Computer Corporation and

lives in Austin Texas.

• Relation extraction identifies specific relations between entities. – Michael Dell is the CEO of Dell Computer Corporation and

lives in Austin Texas.

Page 13: Challenges in nlp

Question Answering • Directly answer natural language questions

based on information presented in a corpora of textual documents (e.g. the web). – When was Barack Obama born? (factoid)

• August 4, 1961

– Who was president when Barack Obama was born? • John F. Kennedy

– How many presidents have there been since Barack Obama was born? • 9

Page 14: Challenges in nlp

Projects & Research

Page 15: Challenges in nlp

Wikitology: A Novel Hybrid Knowledge Base

Derived from Wikipedia

Zareen Syed

Ou r plan s a re going ah ead , He ylig hen i s

get ting ticke ts, so let's put

th at in in ha rd pencil, but keep yo ur eraser h and y a ll the same. My life is

re ally hectic, and I won 't be in a sane

pla ce ti ll mid-April . We 'll p indow n the details later, but a t this po int

we we re consi dering Val onl y to

ta lk on "The Metasystem Tra nsition as th e Quantum of Evolu tion". Th is is

th e theoretical ba se to the PCP, which I

described the form of in my ta lkto WESS. I t's b asically be tw een Fran cis

and Val ho w would like to talk, or

both, or what.

Ou r p lan s are going ahead , Heylighen is

getting t icke ts, so let 's put

th at in in hard pen cil, but keep your erase r hand y all the same . My life is

re ally hect ic, and I won't be i n a sane

pla ce ti ll mid -April. We 'll p indown th e detai ls later, b ut at this point

we we re consideri ng Va l only to

ta lk on "The Metasystem Transitio n as th e Qu antum of Evolution". Th is is

th e th eore ti cal base to the PCP, wh ich I

described the form of in my talkto WESS. It's basica lly be tween Fran cis

and Val how wou ld li ke to tal k, or

both, or wha t.

Ou r plan s a re going ah ead , He ylig hen i s get ting ticke ts, so let's put

th at in in ha rd pencil, but keep yo ur

eraser h and y a ll the same. My life isre ally hectic, and I won 't be in a sane

pla ce ti ll mid-April . We 'll p in

dow n the details later, but a t this po int we we re consi dering Val onl y to

ta lk on "The Metasystem Tra nsition as

th e Quantum of Evolu tion". Th is isth e theoretical ba se to the PCP, which I

described the form of in my ta lk

to WESS. I t's b asically be tw een Fran cis and Val ho w would like to talk, or

both, or what.

Our plans are go ing a head, H eyli ghen is

ge tt in g tickets, so let's p ut

that in in h ard pe ncil, bu t keep your era ser handy all the sa me. My li fe is

rea lly hectic, an d I wo n't be in a sane

place till mid-Apri l. We'll pindo wn the de tails late r, but at this p oin t

we were con side ring Val o nly to

talk on "The Me tasyste m Transition a s the Quan tu m o f Evol ution". This is

the theo retical b ase to th e PCP, which I

de scribed th e fo rm o f in my talkto WESS. It 's basically betwe en Franci s

an d Va l how would like to talk, or

bo th , or wh at.

Ou r p lan s are going ahead , Heylighen is

getting t icke ts, so let 's putth at in in hard pen cil, but keep your

erase r hand y all the same . My life is

re ally hect ic, and I won't be i n a sane pla ce ti ll mid -April. We 'll p in

down th e detai ls later, b ut at this point

we we re consideri ng Va l only tota lk on "The Metasystem Transitio n as

th e Qu antum of Evolution". Th is is

th e th eore ti cal base to the PCP, wh ich I described the form of in my talk

to WESS. It's basica lly be tween Fran cis

and Val how wou ld li ke to tal k, orboth, or wha t.

Ou r p lan s are going ahead , Heylighen is

getting t icke ts, so let 's put

th at in in hard pen cil, but keep your erase r hand y all the same . My life is

re ally hect ic, and I won't be i n a sane

pla ce ti ll mid -April. We 'll p indown th e detai ls later, b ut at this point

we we re consideri ng Va l only to

ta lk on "The Metasystem Transitio n as th e Qu antum of Evolution". Th is is

th e th eore ti cal base to the PCP, wh ich I

described the form of in my talkto WESS. It's basica lly be tween Fran cis

and Val how wou ld li ke to tal k, or

both, or wha t.Ou r p lan s are going ahead , Heylighen is

getting t icke ts, so let 's put

th at in in hard pen cil, but keep your erase r hand y all the same . My life is

re ally hect ic, and I won't be i n a sane

pla ce ti ll mid -April. We 'll p indown th e detai ls later, b ut at this point

we we re consideri ng Va l only to

ta lk on "The Metasystem Transitio n as th e Qu antum of Evolution". Th is is

th e th eore ti cal base to the PCP, wh ich I

described the form of in my talkto WESS. It's basica lly be tween Fran cis

and Val how wou ld li ke to tal k, or

both, or wha t.

Ou r plan s a re going

ahe ad, Heyl igh en is

get ting ticke ts, so let's put

th at in in ha rd

pen cil, but keep your era ser handy

all the same. My life

Ou r plan s a re going ah ead , He ylig hen i s

get ting ticke ts, so let's putth at in in ha rd pencil, but keep yo ur

eraser h and y a ll the same. My life is

re ally hectic, and I won 't be in a sane pla ce ti ll mid-April . We 'll p in

dow n the details later, but a t this po int

we we re consi dering Val onl y

19

Page 16: Challenges in nlp

Introduction and Motivation

Page 20

Ou r plan s a re going ah ead , He ylig hen i s

get ting ticke ts, so let's put

th at in in ha rd pencil, but keep yo ur eraser h and y a ll the same. My life is

re ally hectic, and I won 't be in a sane

pla ce ti ll mid-April . We 'll p indow n the details later, but a t this po int

we we re consi dering Val onl y to

ta lk on "The Metasystem Tra nsition as th e Quantum of Evolu tion". Th is is

th e theoretical ba se to the PCP, which I

described the form of in my ta lkto WESS. I t's b asically be tw een Fran cis

and Val ho w would like to talk, or

both, or what.

Ou r p lan s are going ahead , Heylighen is

getting t icke ts, so let 's put

th at in in hard pen cil, but keep your erase r hand y all the same . My life is

re ally hect ic, and I won't be i n a sane

pla ce ti ll mid -April. We 'll p indown th e detai ls later, b ut at this point

we we re consideri ng Va l only to

ta lk on "The Metasystem Transitio n as th e Qu antum of Evolution". Th is is

th e th eore ti cal base to the PCP, wh ich I

described the form of in my talkto WESS. It's basica lly be tween Fran cis

and Val how wou ld li ke to tal k, or

both, or wha t.

Ou r plan s a re going ah ead , He ylig hen i s get ting ticke ts, so let's put

th at in in ha rd pencil, but keep yo ur

eraser h and y a ll the same. My life isre ally hectic, and I won 't be in a sane

pla ce ti ll mid-April . We 'll p in

dow n the details later, but a t this po int we we re consi dering Val onl y to

ta lk on "The Metasystem Tra nsition as

th e Quantum of Evolu tion". Th is isth e theoretical ba se to the PCP, which I

described the form of in my ta lk

to WESS. I t's b asically be tw een Fran cis and Val ho w would like to talk, or

both, or what.

Our plans are go ing a head, H eyli ghen is

ge tt in g tickets, so let's p ut

that in in h ard pe ncil, bu t keep your era ser handy all the sa me. My li fe is

rea lly hectic, an d I wo n't be in a sane

place till mid-Apri l. We'll pindo wn the de tails late r, but at this p oin t

we were con side ring Val o nly to

talk on "The Me tasyste m Transition a s the Quan tu m o f Evol ution". This is

the theo retical b ase to th e PCP, which I

de scribed th e fo rm o f in my talkto WESS. It 's basically betwe en Franci s

an d Va l how would like to talk, or

bo th , or wh at.

Ou r p lan s are going ahead , Heylighen is

getting t icke ts, so let 's putth at in in hard pen cil, but keep your

erase r hand y all the same . My life is

re ally hect ic, and I won't be i n a sane pla ce ti ll mid -April. We 'll p in

down th e detai ls later, b ut at this point

we we re consideri ng Va l only tota lk on "The Metasystem Transitio n as

th e Qu antum of Evolution". Th is is

th e th eore ti cal base to the PCP, wh ich I described the form of in my talk

to WESS. It's basica lly be tween Fran cis

and Val how wou ld li ke to tal k, orboth, or wha t.

Ou r p lan s are going ahead , Heylighen is

getting t icke ts, so let 's put

th at in in hard pen cil, but keep your erase r hand y all the same . My life is

re ally hect ic, and I won't be i n a sane

pla ce ti ll mid -April. We 'll p indown th e detai ls later, b ut at this point

we we re consideri ng Va l only to

ta lk on "The Metasystem Transitio n as th e Qu antum of Evolution". Th is is

th e th eore ti cal base to the PCP, wh ich I

described the form of in my talkto WESS. It's basica lly be tween Fran cis

and Val how wou ld li ke to tal k, or

both, or wha t.Ou r p lan s are going ahead , Heylighen is

getting t icke ts, so let 's put

th at in in hard pen cil, but keep your erase r hand y all the same . My life is

re ally hect ic, and I won't be i n a sane

pla ce ti ll mid -April. We 'll p indown th e detai ls later, b ut at this point

we we re consideri ng Va l only to

ta lk on "The Metasystem Transitio n as th e Qu antum of Evolution". Th is is

th e th eore ti cal base to the PCP, wh ich I

described the form of in my talkto WESS. It's basica lly be tween Fran cis

and Val how wou ld li ke to tal k, or

both, or wha t.

Ou r plan s a re going

ahe ad, Heyl igh en is

get ting ticke ts, so let's put

th at in in ha rd

pen cil, but keep your era ser handy

all the same. My life

Ou r plan s a re going ah ead , He ylig hen i s

get ting ticke ts, so let's putth at in in ha rd pencil, but keep yo ur

eraser h and y a ll the same. My life is

re ally hectic, and I won 't be in a sane pla ce ti ll mid-April . We 'll p in

dow n the details later, but a t this po int

we we re consi dering Val onl y

Human mind capable of understanding and reasoning over knowledge in different forms and is influenced by contextual factors

World Knowledge is available in different forms

Context important in understanding the semantics of data And may be available in different forms

Page 17: Challenges in nlp

Related Work

Page 21

Michael Jackson

Michael Joseph Jackson (August 29, 1958 – June 25, 2009) was an American singer-songwriter, dancer, actor, choreographer, businessman, philanthropist and record producer.

Contents Life and Career Death . . See Also References External Links

Typical Wikipedia Article

Linked Open Data

Wikipedia Derived Knowledge Resources

Support Structured

Queries

Supports Natural

Language Queries

Wikitology

Supports Hybrid

Queries

Page 18: Challenges in nlp

Wikitology • Linked to LOD Cloud with over 295 datasets

22

Page 19: Challenges in nlp

Wikitology Document Concept Prediction

• Identifying the topics and concepts associated with a document or collection of documents is a common task for many applications such as:

– Annotation and categorization of documents in a corpus

– Modelling user interests

– Business intelligence

– Selecting Advertisements

Page 23

Page 20: Challenges in nlp

Test Document Title Method 1 Ranking Categories Directly

Method 2 Spreading Activation

Pulses=3

Weather Prediction of thunder storms (CNN) “Weather_Hazards” “Meterology”

Prediction for Single Test Document

Experiments

More pulses -> More Generalized Concepts

Data Set Method 1 Ranking Categories

Directly

Method 2 (2 pulses) Spreading Activation on

Category links Graph

Method 3 (2 pulses) Spreading Activation on

Article Links Graph

10 articles related to Organic Farming

Agriculture (Rank 1) Agriculture (in Top 5) Organic_farming (Rank 1)

Prediction for a Set of Documents

Concept not in the

Category Hierarchy

Page 21: Challenges in nlp

Wikitology Cross Document Co-reference Resolution

• Problem:

– Determine whether various named entities in different documents refer to the same object in the world. • Are two documents that talk about “George Bush” talking about the same

George Bush?

– defined as a task in ACE

Page 25

Page 22: Challenges in nlp

Wikitology Entity Linking

• Research Problem: – Given an entity mention string and an article with that

entity mention, find the link to the right Wikipedia entity if one exists.

– Defined as a task in TAC KBP Track

Page 26

John Williams

Richard Kaufman goes a long way

back with John Williams. Trained as a

classical violinist, Californian Kaufman

started doing session work in the

Hollywood studios in the 1970s. One of

his movies was Jaws, with Williams

conducting his score in recording

sessions in 1975...

John Williams author 1922-1994

J. Lloyd Williams botanist 1854-1945

John Williams politician 1955-

John J. Williams US Senator 1904-1988

John Williams Archbishop 1582-1650

John Williams composer 1932-

Jonathan Williams poet 1929-

Knowledge Base

Identify matching entry, or determine that entity is missing from KB

Page 23: Challenges in nlp

Automatic Discovery of Slots and Fillers

Page 27

Slot Score Fillers Example

Musician 1.00 ray_charles, sam_cooke ...

Album 0.99 bad_(album), ...

Location 0.97 gary,_indiana, chicago, …

Music_genre 0.90 pop_music, soul_music, ...

Label 0.79 a&m_records, epic_records, ...

Phonograph_

record 0.67

give_in_to_me, this_place_hotel

Act 0.59 singing

Movie 0.46 moonwalker …

Company 0.43 war_child_(charity), …

Actor 0.41 stan_winston, eddie_murphy,

Singer 0.40 britney_spears, …

Magazine 0.29 entertainment_weekly,…

Writing_style 0.27 hip_hop_music

Group 0.21 'n_sync, RIAA

Song 0.20 d.s._(song) …

New Slots Album Movie Phonograph_record/songs Musician (related Musicians) Act

Page 24: Challenges in nlp

Wikitology Architecture and API

Page 28

Page 25: Challenges in nlp

A Broader Unified Framework for Automatically Enriching Wikitology

Page 29

Page 26: Challenges in nlp

CONCEPT

PREDICTION

INFORMATION

EXTRACTION

PART OF SPEECH

TAGGING

CLUSTERING

CLASSIFICATION

SENTIMENT

ANALYSIS TAXONOMY

MANAGEMENT

ENTITY LINKS

GRAPH

Page 27: Challenges in nlp

Atomic_bombings_of_Hiroshima_and_Nagasaki Enola_Gay George_Weller Little_Boy

"Sixteen hours ago an American airplane dropped one bomb on Hiroshima, Japan, and destroyed its usefulness to the enemy. That bomb had more power than 20,000 tons of T.N.T. It had more than two thousand times the blast power of the British Grand Slam, which is the largest bomb ever yet used in the history of warfare".These fateful words of the President on August 6th, 1945, marked the first public announcement of the greatest scientific achievement in history. The atomic bomb, first tested in New Mexico on July 16, 1945, had just been used against a military target.On August 6th, 1945, at 8:15 A.M., Japanese time, a B-29 heavy bomber flying at high altitude dropped the first atomic bomb on Hiroshima. More than 4 square miles of the city were instantly and completely devastated. 66,000 people were killed, and 69,000 injured.On August 9th, three days later, at 11:02 A.M., another B-29 dropped the second bomb on the industrial section of the city of Nagasaki, totally destroying 1 1/2 square miles of the city, killing 39,000 persons, and injuring 25,000 more.On August 10, the day after the atomic bombing of Nagasaki, the Japanese government requested that it be permitted to surrender under the terms of the Potsdam declaration of July 26th which it had previously ignored.

Title

Enola Gay was the name of the aircraft

Weller's reports from Nagasaki after the nuclear bombing were censored by the United States military but appeared in a book in 2002.

"Little Boy" was the codename of the atomic bomb dropped on Hiroshima

Predicted Concepts

None of this information is present as

words in the given text!

Page 28: Challenges in nlp

Little Boy – Keyword Search

Keyword search retrieves irrelevant

documents in results as well

32

Example: 1

Query : Little Boy

More than 100,000 Results

Page 29: Challenges in nlp

Field : wikiconceptref

Query : Little Boy

A conceptual search only retrieves relevant articles related to

the “little boy” concept

100,000 results vs.

26 Relevant Results

Page 30: Challenges in nlp

BotColony

Page 31: Challenges in nlp

Botcolony

Page 32: Challenges in nlp

20Q game

https://www.botcolony.com/ppSD2/custom/get_started/register-trial.php

Page 33: Challenges in nlp

BotColony 3D Game

Page 34: Challenges in nlp

Thank you

Page 35: Challenges in nlp

Wikitology Related Publications

1. Z. Syed and T. Finin. "Creating and Exploiting a Hybrid Knowledge Base for Linked Data", LNCS, Springer-Verlag. 2010. (submitted)

2. Z. Syed and T. Finin. “Approaches for Enriching Wikipedia”. In Proc. of the AAAI-2010 Workshop on Collaboratively-built Knowledge Sources and Artificial Intelligence. 2010.

3. Z. Syed and T. Finin. “Unsupervised techniques for discovering Ontology elements from Wikipedia article links”. International Workshop on Formalisms and Methodology for Learning by Reading (FAM-LbR). 2010.

4. Z. Syed, T. Finin and V. Mulwad. “Exploiting a Web of Semantic Data for Interpreting Tables”. In Proc. of Web Science Conference, WebSci’2010.

5. T. Finin and Z. Syed. "Creating and Exploiting a Web of Semantic Data", In Proc. of the Second International Conference on Agents and Artificial Intelligence. Jan. 2010.

6. T. Finin, Z. Syed, J. Mayfield, P. McNamee, and C. Piatko, "Using Wikitology for Cross-Document Entity Coreference Resolution", Proceedings of the AAAI Spring Symposium on Learning by Reading and Learning to Read, March 2009.

Page 39

Page 36: Challenges in nlp

Wikitology Related Publications

7. J. Mayfield, D. Alexander, B. Dorr, J. Eisner, T. Elsayed, T. Finin, C. Fink, M. Freedman, N. Garera, P. McNamee, S. Mohammad, D. Oard, C. Piatko, A. Sayeed, Z. Syed, R. Weischedel, “Cross-Document Coreference Resolution: A Key Technology for Learning by Reading”, AAAI 2009 Spring Symposium on Learning by Reading and Learning to Read, March 2009.

8. Z. Syed and T. Finin. "Wikitology: A Novel Hybrid Knowledge Base derived from Wikipedia", In Proc. of the Grace Hopper Celebration of Women in Computing Conference, October 2009. (Abstract)

9. Z. Syed and T. Finin. "Wikitology: Wikipedia as an ontology", In Proc. of the Grace Hopper Celebration of Women in Computing Conference, October 2008. (Abstract)

10. Z. Syed, T. Finin and A. Joshi. 2008. “Wikipedia as an Ontology for Describing Documents”. In Proc. of the International Conference on Weblogs and Social Media. 2008.

11. Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, and Tim Finin, “Entity Disambiguation for Knowledge Base Population”, Proceedings of the 23rd International Conference on Computational Linguistic. 2010.

Page 40