Natural Language Processing and Machine LearningCourse Goals Understand a natural language...
Transcript of Natural Language Processing and Machine LearningCourse Goals Understand a natural language...
Natural Language Processing and Machine Learning
Aarhus Universitet, 2016
Leon Derczynski
Course structure
● Four parts:– Starting NLP and Information retrieval
– Machine learning and sentiment analysis
– Structured prediction, feature extraction
– Entity extraction, social media, unsupervised
● Weeks 6, 7, 9, 11
Course structure
● Two lectures per week– Wednesday afternoon: Theory
– Friday: Practical exercises
● Assessment via hand-ins– One per week; two weeks to complete
– Mixture of coding and analysis
– Final one is free choice
– Submission via e-mail
● Also brief oral exam on topic of your choice
Course Goals
● Understand a natural language processing pipeline● Build a small search engine● Code, use and evaluate a statistical machine
learning tool● Describe many types of machine learning● Describe biases present in ML approaches● Choose the right approach for an NLP problem● Do fundamental NLP tasks● Program Python & NLTK
Natural Language Processing
“Human knowledge is expressed in language. So computational linguistics is very important.”
● Mark Steedman
Natural Language Processing
● Basic AI task– Language presumed unique
– Still a sign of intellect
● Replicating languge comprehension and production is difficult
Natural Language Processing
● What is language?
– Physiological
– Vocal apparatus: velar
– Arose in humans 2M-300k years ago
Natural Language Processing
● Written language much newer
– ~3000 years old
– Not every language has it
– Transiency of spoken vs. intransiency of written
– Allows communication without speaker's presence
Natural Language Processing
● Huge scale of machines and also written data
● Writing is a way of mapping speech to text– Not necessarily a rational mapping
– And speech isn't rational either!
● Eg, kamsteg● Thought Though through tough .. ouch!
Grand Challenges
● Machine translation– Sequence to sequence
● I like your hat● J'aime ton chapeau● Jeg kan godt lide din hat● W x huān n de màoziǒ ǐ ǐ
– Humans seem able to teach languages to each other reasonably well
Grand Challenges
● Machine translation– Old test of AI: John Searle
Grand Challenges
● Machine translation– Old test of AI: John Searle
– Problematic: Are Chinese people still intelligent if they also speak English?
– Point: symbol manipulation ≠ consciousness
Grand Challenges
● Machine translation– When will we finish?
● 1950: Two years! ● 1960: Five years! ● 1970: Twenty years! ● 1985: Never!● 2015: “NLP is kind of like a rabbit in the headlights of the
Deep Learning machine, waiting to be flattened.” - ...– Neil Lawrence (U.Sheffield) @ ICML panel
– “Every time I fire a linguist, accuracy goes up!”● Fred Jelinek @ IBM (allegedly)
Grand Challenges
● Dialogue– This is how we prefer to communicate
– Easy, intuitive mode to engage in
Grand Challenges
● Dialogue– Early systems: SHRDLU
Grand Challenges
● Dialogue– Early systems: SHRDLU
Grand Challenges
● Dialogue– Early systems: SHRDLU, ELIZA
– ELIZA: Please tell me what’s been bothering you.
– ME: Oh, hello, ELIZA. It’s been a long time.
– ELIZA: Hi. What seems to be your problem ?
– ME: Lately, I’ve been feeling a little depressed.
– ELIZA: And how do you feel right now ?
– ME: I feel sad, but relieved to be writing.
– ELIZA: Of what does feeling sad remind you ?
– ME: Well, I’m an alcoholic, and I’ve been sober for a long time, but I drank yesterday. And then I decided to stop drinking again. But still; that makes me feel sad.
– ELIZA: Is it because you are an alcoholic that you came to me ?
– ME: I suppose that that’s one of the reasons.
– ELIZA: Can you elaborate on that ?
– ME: I mean, it’s not the only reason, but alcoholism is a primary thing that has messed up my life.
– http://www.masswerk.at/elizabot/eliza.html
Grand Challenges
● Dialogue
– Many nuances● Timing, context
– Turing test● Tough problem – trick questions
– What does your wife do?– How do you prefer your steaks?– Which route did you take to get here?
● Silly hacks– Eugene Goostman: pose as young & foreign– Lowers audience expectations
Grand Challenges
● Semantic extraction– Textual entailment
– “Frames” - Minsky● World knowledge problem● Break down world into scripts
– Legal understanding
– Inference● All women are humans● If X is a human, are they a woman?
Grand Challenges
● Language Generation– Summarisation
– Descriptions
Grand Challenges
● Language Generation– Journalism
Similar / related past event
Article creation timestamp
Precise sub-event mentions
Event summarydescription
Prior context
Grand Challenges
● Language Generation– Information filtering: what to include
● Not all information is relevant● “5 months, 6 days,
11 hours, 2 minutes-”
– Legal issues● Who owns the content?● Who's responsible?
– “Russia launched a nuclear missile at Chicago”– “Lars Løkke is not very good at chess”
Grand Challenges
● Question Answering
– Take: question; return: answer
– Natural interrogation mode● Where are you going?● How tall is the Eiffel tower?● What... is the air-speed velocity of an unladen swallow?
– Knowledge-base population
Grand Challenges
● Google is getting better!
● Problems:– Searching habits
– How to capturequestions?
Big problem
● Hard to tackle● Some cases can be cast in computational terms
– MT: sequence-to-sequence
– Hard to represent entire world knowledge
● Language provides theoretical background● Decompose!
– Let's build pipelines
Smaller challenges
● Letters in words● Words● Words in a sentence
Phonology and morphology
● Letters in words can describe their sounds● Grouped to form phonemes
– p/b in pit / bit
● Letters also group to form semantic units● Smallest groups are morphemes
– [disagree]/d/ment/s
– [run]ning/ner/s
Parts of speech
● Categories of word– Verb, noun, adjective
● Each word can belong to more than one category– The river bank
– Bank this money
– I work at Google
– I google at work
Grammar
● Chunking– Find sequences of words that represent a concept
– I [ran quickly] towards [the blue bus]
– Useful for getting key phrases
● Parsing– Build tree structure
of sentence
But first…
● We need words!
First step: tokenisation
● Text commonly comes as a sequence of bytes– Difficult to process and design processes for!
● Convert to a sequence of tokens– Mostly, these are like words
● Tokenisation: converting bytes to words– Simple: split by spaces
– [“Simple:”, “split”, “by”, “spaces”]
– Oh dear...
Tokenisation edge cases
● Pretty good, in the general case, isn't it?● What problems can you think of?
– Punctuation: doesn't → doesn, ', t
– Abbreviations: Mr. Gates Mr, ., Gates
– There's always a long tail effect:● More effort, less reward
What if there are no spaces?
● Whatiftherearenospaces● #nowthatcherisdead
What if there are no spaces?
● Whatiftherearenospaces● #nowthatcherisdead
● Or.. what if we're speaking Chinese?– 848M native speakers
– 1197M speakers total
– c.f. 1162 people in USA + Europe
What if there are no spaces?
– I like your hat
– W x huān n de màoziǒ ǐ ǐ– 我喜欢你的帽子
● Sentences can become long!– 丹麦后卫处理以保持英国在欧盟– Denmark backs deal to keep Britain in the EU
● How can we deal with this?– Denmark = 丹麦– Britain = 英国– EU = 欧洲联盟 or 欧盟
What if there are no spaces?
– I like your hat
– W x huān n de màoziǒ ǐ ǐ– 我喜欢你的帽子
● Sentences can become long!– 丹麦后卫处理以保持英国在欧盟– Denmark backs deal to keep Britain in the EU
● How can we deal with this?– Denmark = 丹麦– Britain = 英国– EU = 欧洲联盟 or 欧盟
Gazetteer lookup
● Gazetteer is a list of words– A dictionary is a big one
● Its entries may help segment the sentence● Some long words could be many other words
– Hit dǎ 打– Fire huǒ 火– Machine jī 机– Lighter 打火机
● How can we handle this?
Greedy methods
● Find the longest match first– This is a typical “greedy” search method
#nowthatcherisdead– #
– # now
– # now thatcher
– # now thatcher is
– # now thatcher is dead
– # nowt
– # nowt hat ?
Greedy methods
● What problems can you think of?– OOV: out-of-vocabulary words
● 扑热息痛可能有毒 (paracetamol can be poisonous)
● 扑热息痛 – 可能 – 有毒● New words are guaranteed to arise
– Misalignment● Now that vs. nowt hat
● What recovery strategies?– One word at a time:
● 扑 – 热 – 息 – 痛 – 可能 – 有毒
Now we have words!
● What do they mean?● Many senses per word:
– Bank
● How to separate these?
Word Sense Disambiguation
● Goal: to determine a word's intended sense● WordNet
Overview of noun bank
The noun bank has 10 senses (first 4 from tagged texts)
1. (25) bank -- (sloping land (especially the slope beside a body of water); "they pulled the canoe up on the bank"; "he sat on the bank of the river and watched the currents")
2. (20) depository financial institution, bank, banking concern, banking company -- (a financial institution that accepts deposits and channels the money into lending activities; "he cashed a check at the bank"; "that bank holds the mortgage on my home")
3. (2) bank -- (a long ridge or pile; "a huge bank of earth")
4. (1) bank -- (an arrangement of similar objects in a row or in tiers; "he operated a bank of switches")
The verb bank has 8 senses (first 2 from tagged texts)
● Polysemy is everywhere!
Sense and Distributionality
● “We shall know a word by the company it keeps”● Context gives clues to meaning
– The words before or after help determine the sense
– This context is distributionality: i.e., where the item is distributed.
● Allows adding semantics to unknown words– Wow, that Blart is faster than my Volvo and half the price!
– One way of addressing language acquisition problem
● Not always super-simple– Arabic is FWO
Information Retrieval
● How can we satisfy information needs?
● Related to NLP:– Questions expressed in language
– Information stored linguistically
History
● We do retrieval every day– Finding oranges
– Finding a clean pair of socks
● Finding information is harder– Reading a book every time a question arises
– Linear search!
● Libraries– Rely on indexing..
Explosion
● Now, we do advanced IR dozens of times daily– First Altavista, Google et al.
– Now ubiquitous
● In your Facebook search box,
On your phone, ..
Performance
● Simple method:– Store every document we need to search over
– When we get a query, look through the documents and pick out useful ones
● Problems?– Slow! Bigger collections take longer to search
Invert it!
● Typical research approach● Instead of looking up documents one by one..● Build list of words and the documents that
contain them● Faster access: what drives the speed of
lookup?– Not number of documents
– Number of words seen
Invert it!
● Toy index – six documents– ham: [1,4,6]
– cheap: [2,3,4]
– Aarhus: [1,5,6]
● Sample queries– cheap ham ?
● Document 4
– Aarhus ham ?● Document 1 & 6
● Will this scale for the user?
Not all words are created equal
● Very common words useless– e.g. search for “the”
– Stopwords: words we stop looking for
– Metric: DF – how many documents contain a term
– Any problems?
● Mentioning a term more than once is important– A passing mention suggests less relevance than
regular mentions
– Metric: TF – how many times a term is mentioned
TFIDF
● Term frequency . Inverse Document Frequency● Basic ranking metric
– Rewards terms frequent in a document
– Rewards terms that are rare in the dataset
● Definition:– D: set of documents d; t: search term (token)
– TF(t,d) =
– IDF(t,D) = |D| /
● Variants: +1 smoothing; log normalisation
Retrieval with TFIDFTerm TF DF |D| IDF TF.IDF
the 312 28799 30000 0.018 5.54
in 179 26452 30000 0.055 9.78
cheap 136 179 30000 2.224 302.50
ham 131 231 30000 2.114 276.87
Aarhus 63 98 30000 2.486 156.61
vegetarian 45 142 30000 2.325 104.62
heaven 37 227 30000 2.121 78.48
For the term the:IDF(the) = log10(30000 / 28799) = 0.018TF(the) = 312TFIDF = 0.018 x 312 = 5.54
TFIDF hacks
● Short, focused document vs. long general description– Who wins?
– Long document; TF is higher
● What's the speed like?– Slow: lots of computation per-term for every document
● Will an article on goldfish rank well for a fish query?– Not unless it mentions fish as well as goldfish!
– TFIDF is agnostic to semantics
Vector Space Model
● VSM:– Each word is a dimension
– Plot document according to word frequencies in that dimension
– Cosine, or other Euclideandistance metrics, cangive a similaritymeasure
Vector Space Model
● Imagine dimensions: bird, birds, house, already● Problems?
● Not all concepts/wordsare orthogonal;
● Poor representationof underlyingsemantics
● Advantage: fast
Summary
● NLP: definition● Challenges● Pipeline, tokenisation and segmentation● WSD● First application of tokens: IR● Indexing and term weighting
Practical requirements
● Software: Python 3, NLTK, Scikit-learn● Reading:
– Jurafsky & Martin: “Weighted automata and segmentation” section
– Manning & Schutze: “Corpus based work” first two subsections
– Salton et al.: “Extended Boolean Information Retrieval”
– Brin & Page: “The anatomy of a large-scale hypertextual web search engine”
(this is just an old rejected SIGIR paper from the 90s)
http://www.deepsky.com/~merovech/voynich/voynich_manchu_reference_materials/PDFs/jurafsky_martin.pdfhttp://ics.upjs.sk/~pero/web/documents/pillar/Manning_Schuetze_StatisticalNLP.pdf