Overview of text mining and NLP (+software)
-
Upload
florian-leitner -
Category
Data & Analytics
-
view
246 -
download
4
Transcript of Overview of text mining and NLP (+software)
Text mining and natural language processing
Florian LeitnerTechnical University of Madrid (UPM), Spain
!Tyba
Madrid, ES, 12th of June, 2015
License:
Florian Leitner
Is language understanding & generationkey to artificial intelligence?
• “Her” (Samantha) Movie, 2013• “The Singularity: ~2030”
Ray Kurzweil, Google’s director of engineering• “Watson” & “CRUSH”
IBM’s bet on the future: Datastreams, Mainframes & AI
2
“predict crimes before they happen”
Criminal Reduction Utilizing Statistical History
(IBM, reality) !
Precogs (Minority Report, movie)
if? when?
cognitive computing: “processing information more like a
human than a machine”
GoogleGoogle
Florian Leitner
Examples of text mining andnatural language processing applications.
• Spam filtering
• Document classification
• Social media/brand monitoring
• Opinion mining (& text classification)
• Search engines
• Information retrieval
• Plagiarism detection
• Content-based recommendation systems
• Watson (Jeopardy!, IBM)
• Question answering
• Spelling correction
• Language modeling
• Website translation (Google)
• Machine translation
• Digital assistants (MS’ Clippy)
• Dialog systems (“Turing test”)
• Siri (Apple) and Google Now
• Speech recognit. & language understand.
• Event detection (in e-mails)
• Information extraction
3
Text
Minin
gLanguage Processing
Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)
Concepts & Terminology
Florian Leitner
Document and text classification/clustering
5
1st Principal Component
2nd
Pri
ncip
al C
ompon
ent
document
distance
1st Principal Component
2nd
Pri
ncip
al C
ompon
ent
Centroid
Cluster
Supervised (“Learning to classify from examples”, e.g., spam filtering)vs.
Unsupervised (“Exploratory grouping”, e.g., topic modeling)
LIBSVM
Florian Leitner
Words, Tokens, and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{ { { {
{ { {
NB:
“tokenization”
Splitting: Character-based,
Regular Expressions,
Probabilistic, …Token or Shingle
Florian Leitner
Words, Tokens, and N-Grams/Shingles
6
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{ { { {
{ { {
NB:
“tokenization”
Splitting: Character-based,
Regular Expressions,
Probabilistic, …
Snag: the terms “shingle”, “token” and “n-gram” are not used consistently… but “n-gram” and “token” are far more common!
shingles (unigrams)
2-shingles (bigrams)
3-shingles (trigrams)
“k-shingling”
e.g. all trigrams of the word “sentence”:[sen, ent, nte, ten, enc, nce]
Token N-Grams
Character N-Grams
Token or Shingle
Florian Leitner
Lemmatization, Part-of-Speech (PoS) tagging, and Named Entity Recognition (NER)
7
Token Lemma PoS NER
Constitutive constitutive JJ O
binding binding NN O
to to TO O
the the DT O
peri-! peri-kappa NN B-DNA
B B NN I-DNA
site site NN I-DNA
is be VBZ O
seen see VBN O
in in IN O
monocytes monocyte NNS B-cell
. . . O
de facto standard PoS tagset
{NN, JJ, DT, VBZ, …}Penn Treebank
B-I-O chunk encoding
commonalternatives:
I-OI-E-O
B-I-E-W-O
End token(unigram) Word
Stanford CoreNLP FACTORIE and many more…FreeLing
Linguistic annotations of tokens (used to train automated classifiers).
Begin-Inside-Outside (relevant) token
} chunk
Florian Leitner
Word vectors and inverted indices8
0 1 2 3 4 5 6 7 8 9 10
10
0
1
2
3
4
5
6
7
8
9
count(Word1)
coun
t(Word 2)
Text
1
Text2α
γ
βSimilarity(T1, T2) := cos(T1, T2)
count(Word 3
)
Comparing text vectors:E.g., cosine similarity
Text vectorization:Inverted index
Text 1: He that not wills to the end neither wills to the means.Text 2: If the mountain will not go to Moses, then Moses must go to the mountain.
tokens Text 1 Text 2
end 1 0
go 0 2
he 1 0
if 0 1
means 1 0
Moses 0 2
mountain 0 2
must 0 1
not 1 1
that 1 0
the 2 2
then 0 1
to 2 2
will 2 1 INDRI
“Search engine basics”ea
ch t
oken
/wor
d is
a d
imen
sion
!
Florian Leitner
Inverted indices andthe central dogma of machine learning
9
×=
y = h✓(X)
XTy θ
Rank, Class,
Expectation, Probability, Descriptor*,
…
Inverted index (transposed)
Parameters(θ)
“tex
ts”
(n)
n-grams (p)
instances, observations
variables, features
(Hyperparameters are settings that control the learning algorithm.)
per feature
Florian Leitner
Inverted indices andthe central dogma of machine learning
9
×=
y = h✓(X)
XTy θ
Rank, Class,
Expectation, Probability, Descriptor*,
…
Inverted index (transposed)
Parameters(θ)
“tex
ts”
(n)
n-grams (p)
instances, observations
variables, features
(Hyperparameters are settings that control the learning algorithm.)
per feature
“Nonparametric”
per instance
Florian Leitner
The curse of dimensionality(R.E. Bellman, 1961) [inventor of dynamic programming]
• p ≫ n (far more tokens/features than texts/instances)
• Inverted indices (X) are (discrete) sparse matrices.
• Even with millions of training examples, unseen tokens will keep popping up in during evaluation or in production.
‣ In such a high-dimensional hypercube, most instances are closer to the face of the cube (“nothing”, outside) than other instances.
✓ Remedy: (feature) dimensionality reductionThe “blessing of non-uniformity.”
• feature extraction (compression): PCA/LSA (projection), factor analysis (regression), compression, auto-encoders & deep learning (compression & embedding), …
• feature selection (elimination): LASSO (regularization), SVM (support vectors), Bayesian nets (structure learning), locality sensitivity hashing, random projections, …
10
Applications
Florian Leitner
Google’s review summaries: Opinion mining (“sentiment” analysis).
12
Don’t do it, please… ;-) (If you must: see document and text classification software.)
Florian Leitner
Polarity of sentiment keywords in IMDB.
• å
13
Cristopher Potts. On the negativity of negation. 2011
“not good”
Florian Leitner
Language understanding: Parsing and semantic analysis.
14
disambiguation!
Coreference (Anaphora) Resolution
Named Entity Recognition
Apple Siri
Stanford BLLIP (C-J) Malt LinkGrammar and many more… RedShift
Entity Grounding
disambiguation!
disambiguation!
L. TesnièreN. Chomsky
Florian Leitner
Automatic text summarization: Automatic text summarization:
• Variance/human agreement: When is a summary “correct”?
• Coherence: providing discourse structure (text flow) to the summary.
• Paraphrasing: important sentences are repeated, but with different wordings.
• Implied messages: (the Dow Jones index rose 10 points → the economy is thriving)
• Anaphora (coreference) resolution: very hard, but crucial.
15
…is very difficult because…
Image Source: www.lexalytics.com
Lex[Page]Rank (JUNG) sumy TextTeaserthe author got hired by Google…
Florian Leitner
Machine translation: Deep learning with auto-encoders.
16
‣have only one gender (en) or use opposing genders (es vs. de: el/die !; la/der "; …/das #) ‣have different verb placements (es⬌de). ‣have a different concepts of verbs (latin, arab, cjk). ‣use different tenses (en⬌de). ‣have different word orders (latin, arab, cjk).
Different languages…
DL4J
Florian Leitner
Question answering: The champions league of TM & NLP.
17
Biggest issue: statistical inference
IBM Watson WolframAlpha
Category: Oscar Winning Movies Hint: Its final scene includes the line “I do wish we could chat longer, but I’m
having an old friend for dinner” !!!!
Answer: Silence of the Lamb
All men are mortal.Socrates probably is a man…
…Therefore, Socratesmight be mortal.(cognitive computing)
Florian Leitner
Information extraction: Knowledge mining for molecular biology.
18
BiologicalRepositories
Binary Interactions
Named Entity Recognition Entity Associations Entity Mapping
(Grounding)Relationship Extraction
Relationship Annotations
Cdk5 Rat
TaxID10116
UniProtQ03114
Experimental Methods
Article Classification
Biological Model
Articles
Short Factoid Question Answering
Ontologies & Thesauri
WWW
MITIE OpenDMAP ClearTK
Florian Leitner
Text mining and language processing is all about resolving ambiguities.
19
Anaphora resolutionCarl and Bob were fighting:
“You should shut up,” Carl told him.
Part-of-Speech taggingThe robot wheels out the iron.
ParaphrasingUnemployment is on the rise.
vs The economy is slumping.
Entity recognition & groundingIs Princeton really good for you?
Florian Leitner
Text mining and language processing is all about resolving ambiguities.
20
Anaphora resolutionCarl and Bob were fighting:
“You should shut up,” Carl told him.
Part-of-Speech taggingThe robot wheels out the iron.
ParaphrasingUnemployment is on the rise.
vs The economy is slumping.
Entity recognition & groundingIs Princeton really good for you?