26. Applied IR and Natural Language Processing [1]...
Transcript of 26. Applied IR and Natural Language Processing [1]...
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
1 of 37 11/25/2006 3:30 PM
26. Applied IR and Natural Language Processing [1]
IS 202 - 28 November 2008
Bob Glushko
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
2 of 37 11/25/2006 3:30 PM
Plan for Today's Class
Plan for the "Home Stretch"
Overview and Examples of Applied IR and Natural Language Processing
Machine Translation
Speech Processing
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
3 of 37 11/25/2006 3:30 PM
Plan for the Home Stretch
We are grading your last two assignments
This Week: 2 Lectures on applied information retrieval and natural language processing
Next Tuesday: 202 Alumni Day -- IO & IR from the perspective of recent I-School grads
Thursday 12/8 - last day of class -- course review
Tuesday 12/12 -- final exam (9-12)
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
4 of 37 11/25/2006 3:30 PM
Natural Language Processing
What Dave says
Hal's reply
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
5 of 37 11/25/2006 3:30 PM
Natural Language Processing
NLP has the goal of creating computers and machines that can use natural language -- i.e., the language used by people -- as their inputs and outputs
The field is broad, and involves computer science, linguistics, cognitive psychology, statistics
We're including this "taste" of NLP in this course because it illustrates many IR techniques and in some cases illustrates the tradeoffs between IO and IR
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
6 of 37 11/25/2006 3:30 PM
Real World Applications [1]
Machine Translation
Spelling Suggestions/Corrections
Grammar Checking
Speech Processing
Text Categorization and Clustering
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
7 of 37 11/25/2006 3:30 PM
Real World Applications [2]
Summarization
Synonym Generation
Text Data Mining
Information Extraction
Automated Customer Service
Question Answering
Automated Metadata Assignment
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
8 of 37 11/25/2006 3:30 PM
Two Radically Different Approaches to NLP
Linguistic Approach:
Linguistic models of grammar, morphology, and phonology are essential prerequisites for NLP
We also need to develop models of the "human language processor" and combine them with the linguistic models
Statistical Approach:
Statistical analysis of language samples reveals its structure and patterns
This extracted knowledge -- represented as the occurrence or co-occurrence probabilities of specific things -- can answer many of the questionsthat supposedly require "understanding" or more abstract "rules"
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
9 of 37 11/25/2006 3:30 PM
An Important Motivation for Data-Driven Language Study
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
10 of 37 11/25/2006 3:30 PM
Early Data-driven Perspectives on Language [1]
Just as in the last 10-15 years, as cognitive science has emerged in the intersection of cognitive psychology, linguistics, and computer science, if we look back to when computers were first being invented around 1940 the study of language reflected the prevailing psychology theories of behaviorism and structuralism
These theories held that language was learned through empirical learning mechanisms of conditioning, association, practice in exercising skills
These stimulus -> response notions do not postulate any internal knowledge representation.This perspective on language suggests a statisticalapproach to NLP
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
11 of 37 11/25/2006 3:30 PM
Early Data-driven Perspectives on Language [2]
GK Zipf first expressed "Zipf's Law" in a 1935 book titled Psycho-Biology of Language
All of the work in WWII on codebreaking and cryptography emphasizes the empirical study of word and language patterns (Turing)
Claude Shannon on information theory (1948) saysthat the information in a message is not defined by its content but by its probability of being chosen among several alternatives
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
12 of 37 11/25/2006 3:30 PM
Chomsky's Argument for "Deep" Language Analysis [1]
In late 1950s the statistical approaches fell out of favor, mostly because of the work of Noam Chomsky (Syntactic Structures, 1957)
Colorless green ideas sleep furiously
Furiously sleep ideas green colorless
Neither of these sentences is natural language andwon't occur in language samples, but the former is grammatical and the second isn't
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
13 of 37 11/25/2006 3:30 PM
Chomsky's Argument for "Deep" Language Analysis [2]
So language knowledge can't be based on learned behavior -- because the relevant data is sparse (many reasonable sentences never appear); instead, it is generative and based on rules and representations
This view of language as a formal, mathematically-describable system became the dominant view in linguistics and computer science
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
14 of 37 11/25/2006 3:30 PM
Probabilistic Models: The Data Strikes Back
These spoken phrases can be acoustically identical:
Your lie cured mother
You like your mother
Starting in the late 1980s, speech recognition systems used huge sets of speech data to build probabilistic models
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
15 of 37 11/25/2006 3:30 PM
NLP by Computers != NLP by People
The "rules of language" are a theory of the knowledge that fluent speakers possess (competence), not a theory of how they generate and understand language (performance)
We have no conscious awareness of how we process language, and while we can sometimes explicitly apply the "rules" it certainly doesn't seem that we use that kind of abstract information in "normal" NLP
But likewise, it doesn't seem that we explicitly use information about the likelihood of various language structures occurring or co-occurring, which is what statistical NLP does
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
16 of 37 11/25/2006 3:30 PM
"Learning the statistics" != "Statistically-driven learning"
What people seem to do as they learn language is"statistically-driven learning," not "learning the statistics"
That is, they use statistical evidence to build knowledge about the language into internal representations and language processing mechanisms that are more general than the specific data they were built with
Human language processing has been successfullymodeled using neural nets and other representations in which the "statistics" are encoded in the pattern of activations in a distributed way
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
17 of 37 11/25/2006 3:30 PM
Combined Approaches
Today both the linguistic and data-driven approaches are seen as integral and complementary parts of an NLP application
Systems employ sophisticated techniques for dictionaries and grammars to identify parts of speech and do morphological analysis
The web is such a huge corpus that empirical approaches can be surprisingly informative
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
18 of 37 11/25/2006 3:30 PM
Text Corpora
Computational linguists, computer scientists, experimental psychologists and others rely on text corpora for their research
Prominent pre-web examples include the Brown corpus (Kucera and Francis, 1967) that includes a million words of contemporary American English...
... and the British National Corpus (http://www.natcorp.ox.ac.uk/) that contains 100 million words of contemporary British English
But as large as the BNC is, because of Zipf's Law most words occur fewer than 50 times in 100,000,000 words -- not frequent enough to draw statistical conclusions
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
19 of 37 11/25/2006 3:30 PM
The Web as Corpus
The web is not representative of anything other than itself, but then neither are other text corpora
But the web dwarfs any other (possible?) corpus -- Google probably indexes a few trillion words, making it orders of magnitude larger than any othertext collection
And most of it is freely available
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
20 of 37 11/25/2006 3:30 PM
Phrases in BNC and Google
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
21 of 37 11/25/2006 3:30 PM
Simplistic Computational Linguistics Using Web Corpus
Spelling correction weird or wierd? (compare the body and head titles here)
Word selection in translation:French phrase groupe de travail
groupe translates to cluster, group, grouping, concern, collective
travail translates to work, labor, labour
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
22 of 37 11/25/2006 3:30 PM
Machine Translation: An Apocryphal but Important Example
A story often told about the early days of machine translation research (1950s) is that the English sentence:
The spirit is willing, but the flesh is weak
when translated into Russian, and then back to English became:
The vodka is strong but the meat is rotten
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
23 of 37 11/25/2006 3:30 PM
Machine Translation: A BriefHistory [1]
Great optimism in the 1950s was followed by extreme pessimism
In 1966 a report by a the Automatic Language Processing Advisory Committee (ALPAC) concluded that "there is no immediate or predictable prospect of useful machine translation" and instead recommended the development of computer aids for human translators
Fortunately, ALPAC also recommended the continued support of basic research in computational linguistics
In the 1970s and 1980s MT systems continued to develop, strongly driven by the emergence of microcomputers and text processing. The dominanttechnical strategy relied on hand-crafted syntactic parsers, morphological analyzers, and dictionaries - intensely semantic and rule-based approaches.
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
24 of 37 11/25/2006 3:30 PM
Machine Translation: A BriefHistory [2]
The beginning of the 1990s was a major turning point. An IBM research group developed a systemcalled Candide that relied purely on statistical analysis and "example-based" methods for phrase matching and translation
Candide used a very large corpus of English and French documents that had extremely reliable translations (in both directions)
This approach has really taken off with the emergence of the Web for obvious reasons...
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
25 of 37 11/25/2006 3:30 PM
How Good is Machine Translation? [1]
Google Translate
Original Text: Microsoft's release of its Xbox 360 video-games console begins a new phase in the battle to remove Sony's PlayStation from the top spot. If it succeeds, the software giant may be tempted to make more incursions into the competitive market for home-entertainment hardware.
Roundtrip through French (Nov 2005):The release of Microsoft of its console the videoone of Xbox 360 begins a new phase in the battle to remove PlayStation of Sony of the higher spot. If itsucceeds, the giant of software can be tried to transform more incursions into the competitive market for the material of house-entertainment.
Roundtrip through French (Nov 2006):The release of Microsoft of its console the videoone of Xbox 360 begins a new phase in the battle to remove PlayStation of Sony of the higher spot. If it
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
26 of 37 11/25/2006 3:30 PM
succeeds, the giant of software can be tried to transform more incursions into the competitive market for the material of house-entertainment
How Good is Machine Translation? [2]
Original Text: Microsoft's release of its Xbox 360 video-games console begins a new phase in the battle to remove Sony's PlayStation from the top spot. If it succeeds, the software giant may be tempted to make more incursions into the competitive market for home-entertainment hardware.
Roundtrip through German (Nov 2005): ReleaseMicrosofts of its video game console Xbox 360 begins a new phase in the battle for removing from PlayStation Sonys from the upper point. If itfollows, the software giant can be provoked, in order to form more ideas into the free market for house maintenance small articles.
Roundtrip through German (Nov 2006): Release of Microsoft of its Xbox 360 video game console begins a new phase in the battle to remove to of PlayStation Sony from the upper point to. If it follows, the software giant can be provoked, in
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
27 of 37 11/25/2006 3:30 PM
order to form more ideas into the free market for house maintenance of small articlesHow Good is Machine
Translation? [3]
Original Text: Microsoft's release of its Xbox 360 video-games console begins a new phase in the battle to remove Sony's PlayStation from the top spot. If it succeeds, the software giant may be tempted to make more incursions into the competitive market for home-entertainment hardware.
Roundtrip through Chinese (Nov 2005): Its Xbox 360 video-games control bench Microsoft. The srelease starts one new stage removes Sony in thisbattle; s PlayStation from this top spot. If itsucceeds, perhaps the software giant does invadesinto the competitive market for the familyentertainment hardware
Roundtrip through Chinese (Nov 2006): Microsoft releases xbox360 video game consoles beginning of a new stage in the battle to remove Sony's games from the top. If successful,The software giant may be more competitive intrusion into family
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
28 of 37 11/25/2006 3:30 PM
entertainment hardware market
Word Sense Disambiguation using Lexical Co-Occurrences
The co-occurrences of words in a text collection can tell us what the documents are about and distinguish different senses of polysemous words
Co-occurrences of frequent words are uninteresting:
"doctor" co-occurs with "with," "a," and "is"
Co-occurrences between two less frequent words help up understand what a text is about:
"doctor" co-occurs with "honorary," "dentist," "nurse," "examine, "treat," etc.
"drug" co-occurs with "price," "prescription" and "patient" in medical contexts and with "abuse," "paraphernalia," and "illicit" in non-medical contexts
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
29 of 37 11/25/2006 3:30 PM
Pre-Nominal Adjective Ordering
In translation and text generation we need to arrange nouns and adjectives:
My big fat Greek wedding is grammatical
My fat Greek big wedding is not
Some NLP approaches attempt this using rules
But it can also be done using data-intensive approaches; compare frequency of {a, b} with {b, a}
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
30 of 37 11/25/2006 3:30 PM
Speech Recognition
As we saw in the 2001: A Space Odyssey exampleI started with, speech recognition has great potential to enhance how people interact with machines
Like machine translation, speech recognition has a long history in NLP, but it raises many additional technical challenges because of the acoustical and signal processing challenges
And like machine translation, speech processing had a rule-based phase but it is now much more statistical in character
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
31 of 37 11/25/2006 3:30 PM
Speech Recognition - SimplifiedProcess Depiction
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
32 of 37 11/25/2006 3:30 PM
Speech Recognition -- Uses Transition Probabilities
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
33 of 37 11/25/2006 3:30 PM
Speech Recognition - Hidden Markov Models
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
34 of 37 11/25/2006 3:30 PM
Speech Recognition for Consumers
Continuous speech recognition systems for dictation can be remarkably accurate if the system has been trained to recognize the speaker's voice
Nuance Dragon Naturally Speaking(www.nuance.com/naturallyspeaking)
IBM Via Voice(www-306.ibm.com/software/voice/viavoice)
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
35 of 37 11/25/2006 3:30 PM
Speech Recognition in IVR Applications
Integrated Voice Response (IVR) systems commonly used in automated customer support and self-service applications
They need to be speaker-independent, and achieving this is easier to the extent that the domain and vocabulary are constrained
This technology is NOT for consumers; it is often sold by telecom OEMs and VARs (EXAMPLE)
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
36 of 37 11/25/2006 3:30 PM
Speech Synthesis
Text-to-speech synthesis (thanks to ATT labsdemo)
Being able to synthesize speech from computer-readable text is useful in applications that require spontaneous interaction ...
... or where reading isn't practical, like when you're driving and need directions.
Speech synthesis enables accessibility for people who are sight- or voice-impaired
When TTS is aimed at the general public, "naturalness" is essential to acceptance
TTS seems like the reverse of speech recognition, but there is limited technology overlap and neither involves much language understanding
26. Applied IR and Natural Language Processing [1] (1) file:///C:/Documents%20and%20Settings/glushko/My%20Documents/L...
37 of 37 11/25/2006 3:30 PM
Readings for IO & IR Lecture #27
"A Plan for Spam" Paul Graham
"Web question answering: Is more always better? (SIGIR '02)" Susan Dumais, Michele Banko, Eric Brill, Jimmy Lin, Andrew Ng
"Mining and Summarizing Customer Reviews"Minquing Hu and Bing Liu. Proceedings of thetenth ACM SIGKDD (2004)