TOOLS AND TOOLKITS - Courses...REUSABLE COMPONENTS De ning standard sub-tasks ! can reuse models and...
Transcript of TOOLS AND TOOLKITS - Courses...REUSABLE COMPONENTS De ning standard sub-tasks ! can reuse models and...
DAY 2: NLU PIPELINE AND TOOLS
Mark Granroth-Wilding
1
NLP PIPELINE: REMINDER
• Complete NLU too complex
• Break into manageable subtasks
• Develop separate tools
• Different applications, different combinations
• Reuse effort on individual tools
• Lot of research effort on subtasks
• Often, tools & data public
Web text
Structuredknowledge base
Low-levelprocessing
. . .
Annotatedtext
Abstractprocessing
Furtherannotated text
4
NLP PIPELINE: REMINDER
• Advantages:• Reuse of tools• Common work on subtasks
E.g. parsing• Evaluation of components• Easy combinations for many applications
• Disadvantages of pipeline:• Discrete stages: no feedback• Improvements on sub-tasks
might not benefit pipelines
Web text
Structuredknowledge base
Low-levelprocessing
. . .
Annotatedtext
Abstractprocessing
Furtherannotated text
5
NLG PIPELINE
• Natural Language Generation can also use pipeline
• Same reasons, same drawbacks
• Not so standardized
• Far fewer tools for sub-tasks
More on day 6 about NLG pipeline
6
REUSABLE COMPONENTS
• Defining standard sub-tasks →can reuse models and tools across pipelines
• Improvements on sub-tasks benefit many applications
• Publicly available code/tools/models. E.g.:
Does:tokenization,POS tagging,named-entity
recognition, ...
Used by:Adam:
question answering
Dragonfire:virtual assistant
EpiTator : infectiousdisease tracker
7
DATA IN THE PIPELINE
• Data passed between components varies greatly
• Components perform analysis of input• Output annotations
• word• sentence• discourse• document
8
LEVELS OF ANALYSIS
• Sub-word• Character (grapheme): A l i c e w a l k s
• Phoneme, linguistic sound unit: æ l i s w O k s
• Morpheme, smallest meaningful unit: alice walk- s
• Word: alice walks quickly
• Phrase, clause: alice walks, then she runs
• Sentence, utterance
• Paragraph, section, discourse turn, ...
• Document
• Document collection, corpus
9
TOOLS AND TOOLKITS
• Pipeline allows component reuse
• Tools for subtasks can be shared
• Many toolkits provide standard components
• Compare:• accuracy• speed• pre-trained models (domain, language)
11
EXAMPLE: STANFORD CORENLP
en de fr ar zh
tokenization
POS tagging
lemmatization
NER
parsing
dep parsing
coref
sentiment...
Java / command line / APIOpen source(Demo coming up. . . )
12
EXAMPLE: SPACY
en de fr es it
tokenization
POS tagging
NER
dep parsing
Fewer tools, different languagesPythonOpen sourceVery fast(See assignments)
13
EXAMPLE: GENSIM
• Specialized tool
• Topic modelling(see later in course)
• Language independent
• Late in pipeline: abstract analysis
• Use other tools/toolkits for earlier stages:• tokenization• lemmatization• etc...
14
EXAMPLE: GENSIM
Text corpus (documents)
Sentence split
Tokenize
Sentences
Lemmatize
Tokens
Count document words
Lemmas
Train topic model
Bags of words
Trained model parameters
15
NLU SUB-TASKS
• Some typical sub-tasks
• Mostly early in pipeline: “low-level”
• Brief intro: more on some later
1. Speech recognition
2. Text recognition
3. Morphology
4. POS-tagging
5. Parsing
6. Named-entity recognition
7. Semantic role labelling
8. Pragmatics
Language data
Structuredknowledge repr.
Low-levelprocessing
. . .
Annotatedtext
Abstractprocessing
Furtherannotated text
low-level
more abstract
abstract
18
SPEECH RECOGNITION
• Understanding human speech
SpeechAudio signal
Finally a small settlement
loomed ahead. It was of the
familiar style of toy-building-
block architecture affected by
the ant-men, and...
Text
• NL interfaces
• Noisy: challenges for NLP further on
• Components:• Acoustic model: audio → text• Language model: expectations about text More later...
19
TEXT RECOGNITIONOptical character recognition (OCR)
• Understanding printed/written text
Scanned document
Finally a small settlement loomed
ahead. It was of the familiar style of
toy-building-block architecture...
Text
• E.g. digitizing libraries
• Huge variation in how letters appear:
• Earlier: model image → character classification
• Recent methods: take into account context
20
TOKENIZATION
• Many methods use word-based analysis
• What is a word?
• Often split text → word (token) sequence: tokenization
First approximation: split on spaces
Arkilu pursed her lips inthought.
Arkilu / pursed / her / lips /in / thought / .
21
TOKENIZATION
First approximation: split on spaces
Often not good enough:
“Really meaning,” Arkiluinterposed, . . .
“ / Really / meaning / , / ” /Arkilu / interposed / , . . .
Some other tricky cases:
black-furred, to-day, N.Y.U., 5,000
Language-specific
22
MORPHOLOGY
• Analysis of internal structure of words
unhelpfulness → un+help+ful+ness
• Splitting words into stems, affixes, compounds
thunderstorm → thunder+storm
• Useful for:• categorization using morphological features -s → plural• text normalization robot, robot, robot’s → robot• generation
robot+plur → robotsman+plur → men
More on specific methods later today and (tomorrow).
23
POS TAGGING
• Part of speech: ancient form of shallow grammatical analysis
• Token-level categories
• Distinguish syntactic function of words in broad classes
For the present
Noun
, we are. . . vs. The
Adjective
present situation. . .
Other ambiguity remains: He gave a present
Also noun
Common NLP subtask: part-of-speech (POS) tagging
More on POS-tagging methods and statistical models tomorrow
24
PARSING
• Syntax: models of structures underlying NL sentences
• Captures dependencies between words
• Analysis required to interpret meaning (semantics)
Alice who saw the man, who pushed Bob, who ate the apple, walks quickly
• Parsing: inference of structure underlying sentence
25
PARSING
• Parsing: inference of structure underlying sentence
• Useful for:• Modelling human language processing• Disambiguation of sentence meaning• Structuing statistical models• Splitting sentences into meaningful units (chunks)• Much more!
More on syntax, parsing and statistical models on day 4.
26
NAMED ENTITY RECOGNITION
• Named entities: referencesto people, organisations,products, etc
• Identification important forextracting structured data
• Can link to known entities inknowledge base
• Many practical uses
Example
DARPA hopes the ALIASprogramme will produce anautomated system that will becost effective.In addition to the 737 simulatorand the Cessna light aircraft,Aurora has also flown a DiamondDA42 light aircraft and a BellUH-1 helicopter.
NER comes up again on day 8.
27
SEMANTIC ROLE LABELLING
• Semantic roles capture key parts of structure of meaning ina sentence
• Who did what to whom?
Alice saw the man that Bob pushed
Alice is agent of sawman is patient of saw
man is patient of pushed
• Semantic role labelling: inference of these relationships
• More abstract than syntax
• Less structured than formal semantics
28
PRAGMATICS
• Pragmatics concerns meaning in broader context
• Includes questions of e.g.• conversational context• speaker’s intent• metaphorical meaning• background knowledge
• Abstract analysis
• Depends heavily on other types of analysis seen here
• Many unsolved problems
• Some tasks tackled in NLP: late in pipelines
Some aspects of pragmatics covered on day 10.
29
PIPELINE EXAMPLE REVISITEDIn small groups
• Repeat exercise from yesterday
• Assume you:• are a computer• have database of logical/factual world knowledge• have lots of rules/statistics about English
• What is involved in:• extracting & encoding relevant information?• answering the question?
• Formulate as pipeline
• Don’t worry about correct component names!
31
PIPELINE EXAMPLE REVISITEDIn small groups
• Assume you:• are a computer• have database of logical/factual world knowledge• have lots of rules/statistics about English
• What is involved in:• extracting & encoding relevant information?• answering the question?
• Formulate as pipeline
• Don’t worry about correct component names!
A robotic co-pilot developed underDARPA’s ALIAS programme hasalready flown a light aircraft.
What agency has created acomputer that can pilot a plane?
32
PIPELINE EXAMPLE REVISITEDIn small groups
• Assume you:• are a computer• have database of logical/factual world knowledge• have lots of rules/statistics about English
• What is involved in:• extracting & encoding relevant information?• answering the question?
• Let’s see some pipelines!
A robotic co-pilot developed underDARPA’s ALIAS programme hasalready flown a light aircraft.
What agency has created acomputer that can pilot a plane?
33
EXAMPLE PIPELINEInformation Extraction
Raw text
Sentencesegmentation
Tokenization
Sentences
Part-of-speechtagging
Tokenizedsentences
Entitydetection
POS-taggedsentences
Relationdetection
Chunkedsentences
Relations
Text input
Facebook chairman Mark Zuckerberg wassummoned to testify in front of EUlawmakers.
Relation output
(Mark Zuckerberg, chairman-of,
Facebook)
. . .
More on later stageson day 8.
Example from NLTK Book: https://www.nltk.org/book/ch07.html
34
SOME OTHER TOOLKITS
OpenNLP
Tokenization, POS tagging, lemmatization, parsing, NER, ...Pretrained models (some components): en, de, es, nl, pt, se.Java / command line
NLTK
Tokenization, POS tagging, lemmatization, parsing, NER,language modelling, WSD, ...Some pretrained models, mostly en.Python. Primarily developed for teaching
36
SOME OTHER TOOLKITS
TextBlob
Tokenization, POS tagging, lemmatization, simple parsingModels: en, fr, de.Python. Easy to use
Flair
Tokenization, POS tagging, NER, ...Pretrained models: mostly en. Some de, fr.Python. Fast. SOTA for many tasks
37
LIVE DEMOStanford CoreNLP
• Online demo with visualization:http://corenlp.run/
• Many tools can be selected
• Some run whole pipelines: e.g. OpenIE (informationextraction), sentiment
• Let’s try some examples. . .
38
SUB-TASKS COMING UP
More on some sub-tasks later in course:
• Morphology: today and tomorrow
• POS tagging: day 3
• Syntactic parsing: day 4
• NLG sub-tasks / components: day 6
• Word-level (lexical) meaning: day 7
• Document-level meaning: day 7
• Named-entity recognition: day 8
• Discourse, pragmatics: day 10
39
REGULAR EXPRESSION AND FSAs
Coming up today:
• Brief reminder of regular expressions
• Theory and notation for finite-state automata
• Introduction to morphological analysis
Groundwork for tomorrow, when we put these things together
42
REGULAR EXPRESSIONSBrief reminder
Pattern Matches
/radio/ ‘It is the radio. Know then, O Queen/[Rr]adio/ late that night, the Radio Man/[sbt]ack/ Crota was already back in the fray/[0-9]/ showed the time to be 1025;/radio sets?/ powerful radio sets invented by the beast
returning to the hidden radio set, whence/([Dd]it|[Dd]ah)/ ‘Dah-dit-dah-dit dah-dah-dit-dah.
43
REGULAR EXPRESSIONSBrief reminder
Pattern Matches
/final*/ we finally restored it,
/final+/ we finally restored it,
/radio./ conventional radioese, I repeated/.(it|ah)(-.(it|ah))*/ ‘Dah-dit-dah-dit dah-dah-dit-dah.
• Need more of a reminder? Jurafsky & Martin 3, 2.1
• Some common regex features not technically regularE.g. memory
44
REGULAR LANGUAGES
• Regular expression defines a regular language
• Set of strings accepted by regex
L(r) = {s|accept(r , s)}
• Language is regular iff ∃ regex for it
• Regular languages describe some aspects of NL
• Useful tool for some NLP
• Particularly: preprocessing & early stages of analysis
• See limitations tomorrow
45
FINITE-STATE AUTOMATA
• Another string-testing formalism:finite-state automaton (FSA)
• Process string by transitioning between states
1 2 3a
a
h
{ ‘ah’, ‘aah’, ‘aaah’, ... }
• Equivalent regex: /a+h/
46
ANOTHER FSA
1 2 3a
a
h
,
{ ‘ah, ah’, ‘ah, aah’, ‘aaaaah, ah, aah’, ... }
• Equivalent regex: /a+h(, a+h)*/
47
FSAs & REGEXES
• Acceptance by FSA ≡ acceptance by regex
• Every FSA has equivalent regex
Regularlanguages
FSAsRegular
expressions
(Regular grammars)
48
FINITE STATE TRANSDUCERS
• Extend FSA to output something: not just YES/NO
• Each accepting edge also outputs
• Finite-state transducer
• Same strings/language as FSA
• Output may be:• translation• analysis• spelling transformation• . . .
49
FSA → FST
1 2
dit
dah
-,
{ ‘dit-dah-dah, dah-dit’, ... }
/(dit|dah)(-(dit|dah))*(, (dit|dah)(-(dit|dah))*)*/
50
FSA → FST
1 2
dit : 0
dah : 1
- : ε
, : /
• Each state: input : output
• Translates
dit-dah-dah, dah-dit, dit ⇒ 0-1-1/1-0/1
• Common use in NLP: analysis of internal word structure→ morphology
51
MORPHOLOGY: SOME CONCEPTS
• Morpheme: smallest grammatical unit in language
• Affix: morpheme that occurs only together with others
• Word = stem [ + affixes ]
radios → radio + s
• Compound: word with multiple stems
thunderstorms → thunder + storm + s
52
MORPHOLOGY: SOME CONCEPTS
Types of affix:
• prefix: un+help+ful
• suffix: taste+ful, taste+ful+ness
• infix: internal affix (e.g. Arabic)
• circumfix: prefix & suffixE.g. German: ge+kauf+t
53
MORPHOLOGY: SOME CONCEPTS
Two types of morphology:
1. Inflectional: regular patterns for word classes• Changes grammatical roles• E.g. noun cases: kauppa, kauppa+a, kaupa+n, ...
2. Derivational: creates new words• Changes meaning• E.g. diminutive suffix: tuuli → tuulo+nen
54
MORPHOLOGICAL AMBIGUITY
• Morpheme-level:• Morphemes can have multiple interpretations/uses• table: noun or verb• +s: plural noun or 3rd-person singular verb
• Structural:• Words may be split in multiple ways• unionised → union+ise+ed / un+ion+ise+ed
• Bracketing:• Same split, different bracketing structures• unlockable → (un+lock)+able / un+(lock+able)
55
USES IN NLP
A few uses of morphology in NLP:
• Morphosyntactic categorization (rough POS tagging)
• Morphological features
• Stemming/lemmatization
• Generation: apply syntax, features, agreement to base forms
the loyal princes occupied the throne in his absencedet adj noun verb det noun prep det noun
adv adjprn
56
USES IN NLP
A few uses of morphology in NLP:
• Morphosyntactic categorization (rough POS tagging)
• Morphological features
• Stemming/lemmatization
• Generation: apply syntax, features, agreement to base forms
the loyal princes occupied the throne in his absencedef-sg pl past def-sg sg m-sg sgdef-pl pst-prt def-pl
56
USES IN NLP
A few uses of morphology in NLP:
• Morphosyntactic categorization (rough POS tagging)
• Morphological features
• Stemming/lemmatization
• Generation: apply syntax, features, agreement to base forms
the loyal princes occupied the throne in his absencethe loyal prince occupy the throne in his absence
56
USES IN NLP
A few uses of morphology in NLP:
• Morphosyntactic categorization (rough POS tagging)
• Morphological features
• Stemming/lemmatization
• Generation: apply syntax, features, agreement to base forms
the loyal prince occupy the throne in prn absence+pl +pl +pl +past +sg +sg +pos +sg
+masc+sgthe loyal princes occupied the throne in his absence
56
SUMMARY
• NLU typically broken into subtasks
• Pipeline of components
• Complex applications reuse standard tools, models, datasets
• Components annotate linguistic data
• Many levels of analysis
• Comparing ready-made tools for subtasks
• Some typical sub-tasks / pipeline components
• Example pipelines
• Some available toolkits
• Refresher on regular expressions and FSAs
• Intro to morphology
57
READING MATERIAL
Introductions to:
• Speech recognition: J&M2 p319-21
• Morphology: J&M3 2.4.4 (J&M2 p79-80)
• POS tagging: J&M3 8.1, 8.3 (J&M2 p157-8, 167-9)
• Syntax & parsing: J&M3 11.1 (J&M2 p419-20, 461)
• NER: J&M3 17.1 (J&M2 p761-6)
Online introduction to Stanford CoreNLP toolkit
58
NEXT UP
After lunch:Practical assignments in BK107
9:15 – 12:00 Lectures12:00 – 13:15 Lunch
13:15 – ∼14:00 Introduction14:00 – 16:00 Practical assignments
• Building an NLP pipeline
• Errors propagating through pipeline
• Comparison of tools
• A complete application
59