Overview of text mining and NLP (+software)

Text mining and natural language processing

Florian LeitnerTechnical University of Madrid (UPM), Spain

!Tyba

Madrid, ES, 12th of June, 2015

License:

Florian Leitner

Is language understanding & generationkey to artificial intelligence?

• “Her” (Samantha) Movie, 2013• “The Singularity: ~2030”

Ray Kurzweil, Google’s director of engineering• “Watson” & “CRUSH”

IBM’s bet on the future: Datastreams, Mainframes & AI

2

“predict crimes before they happen”

Criminal Reduction Utilizing Statistical History

(IBM, reality) !

Precogs (Minority Report, movie)

if? when?

cognitive computing: “processing information more like a

human than a machine”

GoogleGoogle

Florian Leitner

Examples of text mining andnatural language processing applications.

• Spam filtering

• Document classification

• Social media/brand monitoring

• Opinion mining (& text classification)

• Search engines

• Information retrieval

• Plagiarism detection

• Content-based recommendation systems

• Watson (Jeopardy!, IBM)

• Question answering

• Spelling correction

• Language modeling

• Website translation (Google)

• Machine translation

• Digital assistants (MS’ Clippy)

• Dialog systems (“Turing test”)

• Siri (Apple) and Google Now

• Speech recognit. & language understand.

• Event detection (in e-mails)

• Information extraction

3

Text

Minin

gLanguage Processing

Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)

Concepts & Terminology

Florian Leitner

Document and text classification/clustering

5

1st Principal Component

2nd

Pri

ncip

al C

ompon

ent

document

distance

1st Principal Component

2nd

Pri

ncip

al C

ompon

ent

Centroid

Cluster

Supervised (“Learning to classify from examples”, e.g., spam filtering)vs.

Unsupervised (“Exploratory grouping”, e.g., topic modeling)

LIBSVM

Florian Leitner

Words, Tokens, and N-Grams/Shingles

6

This is a sentence .

This is is a a sentence sentence .

This is a is a sentence a sentence .

This is a sentence.

{ { { {

{ { {

NB:

“tokenization”

Splitting: Character-based,

Regular Expressions,

Probabilistic, …Token or Shingle

Florian Leitner

Words, Tokens, and N-Grams/Shingles

6

This is a sentence .

This is is a a sentence sentence .

This is a is a sentence a sentence .

This is a sentence.

{ { { {

{ { {

NB:

“tokenization”

Splitting: Character-based,

Regular Expressions,

Probabilistic, …

Snag: the terms “shingle”, “token” and “n-gram” are not used consistently… but “n-gram” and “token” are far more common!

shingles (unigrams)

2-shingles (bigrams)

3-shingles (trigrams)

“k-shingling”

e.g. all trigrams of the word “sentence”:[sen, ent, nte, ten, enc, nce]

Token N-Grams

Character N-Grams

Token or Shingle

Florian Leitner

Lemmatization, Part-of-Speech (PoS) tagging, and Named Entity Recognition (NER)

7

Token Lemma PoS NER

Constitutive constitutive JJ O

binding binding NN O

to to TO O

the the DT O

peri-! peri-kappa NN B-DNA

B B NN I-DNA

site site NN I-DNA

is be VBZ O

seen see VBN O

in in IN O

monocytes monocyte NNS B-cell

. . . O

de facto standard PoS tagset

{NN, JJ, DT, VBZ, …}Penn Treebank

B-I-O chunk encoding

commonalternatives:

I-OI-E-O

B-I-E-W-O

End token(unigram) Word

Stanford CoreNLP FACTORIE and many more…FreeLing

Linguistic annotations of tokens (used to train automated classifiers).

Begin-Inside-Outside (relevant) token

} chunk

Florian Leitner

Word vectors and inverted indices8

0 1 2 3 4 5 6 7 8 9 10

10

0

1

2

3

4

5

6

7

8

9

count(Word1)

coun

t(Word 2)

Text

1

Text2α

γ

βSimilarity(T1, T2) := cos(T1, T2)

count(Word 3

)

Comparing text vectors:E.g., cosine similarity

Text vectorization:Inverted index

Text 1: He that not wills to the end neither wills to the means.Text 2: If the mountain will not go to Moses, then Moses must go to the mountain.

tokens Text 1 Text 2

end 1 0

go 0 2

he 1 0

if 0 1

means 1 0

Moses 0 2

mountain 0 2

must 0 1

not 1 1

that 1 0

the 2 2

then 0 1

to 2 2

will 2 1 INDRI

“Search engine basics”ea

ch t

oken

/wor

d is

a d

imen

sion

!

Florian Leitner

Inverted indices andthe central dogma of machine learning

9

×=

y = h✓(X)

XTy θ

Rank, Class,

Expectation, Probability, Descriptor*,

…

Inverted index (transposed)

Parameters(θ)

“tex

ts”

(n)

n-grams (p)

instances, observations

variables, features

(Hyperparameters are settings that control the learning algorithm.)

per feature

Florian Leitner

Inverted indices andthe central dogma of machine learning

9

×=

y = h✓(X)

XTy θ

Rank, Class,

Expectation, Probability, Descriptor*,

…

Inverted index (transposed)

Parameters(θ)

“tex

ts”

(n)

n-grams (p)

instances, observations

variables, features

(Hyperparameters are settings that control the learning algorithm.)

per feature

“Nonparametric”

per instance

Florian Leitner

The curse of dimensionality(R.E. Bellman, 1961) [inventor of dynamic programming]

• p ≫ n (far more tokens/features than texts/instances)

• Inverted indices (X) are (discrete) sparse matrices.

• Even with millions of training examples, unseen tokens will keep popping up in during evaluation or in production.

‣ In such a high-dimensional hypercube, most instances are closer to the face of the cube (“nothing”, outside) than other instances.

✓ Remedy: (feature) dimensionality reductionThe “blessing of non-uniformity.”

• feature extraction (compression): PCA/LSA (projection), factor analysis (regression), compression, auto-encoders & deep learning (compression & embedding), …

• feature selection (elimination): LASSO (regularization), SVM (support vectors), Bayesian nets (structure learning), locality sensitivity hashing, random projections, …

10

Applications

Florian Leitner

Google’s review summaries: Opinion mining (“sentiment” analysis).

12

Don’t do it, please… ;-) (If you must: see document and text classification software.)

Florian Leitner

Polarity of sentiment keywords in IMDB.

• å

13

Cristopher Potts. On the negativity of negation. 2011

“not good”

Florian Leitner

Language understanding: Parsing and semantic analysis.

14

disambiguation!

Coreference (Anaphora) Resolution

Named Entity Recognition

Apple Siri

Stanford BLLIP (C-J) Malt LinkGrammar and many more… RedShift

Entity Grounding

disambiguation!

disambiguation!

L. TesnièreN. Chomsky

Florian Leitner

Automatic text summarization: Automatic text summarization:

• Variance/human agreement: When is a summary “correct”?

• Coherence: providing discourse structure (text flow) to the summary.

• Paraphrasing: important sentences are repeated, but with different wordings.

• Implied messages: (the Dow Jones index rose 10 points → the economy is thriving)

• Anaphora (coreference) resolution: very hard, but crucial.

15

…is very difficult because…

Image Source: www.lexalytics.com

Lex[Page]Rank (JUNG) sumy TextTeaserthe author got hired by Google…

http://www.lexalytics.com

Florian Leitner

Machine translation: Deep learning with auto-encoders.

16

‣have only one gender (en) or use opposing genders (es vs. de: el/die !; la/der "; …/das #) ‣have different verb placements (es⬌de). ‣have a different concepts of verbs (latin, arab, cjk). ‣use different tenses (en⬌de). ‣have different word orders (latin, arab, cjk).

Different languages…

DL4J

Florian Leitner

Question answering: The champions league of TM & NLP.

17

Biggest issue: statistical inference

IBM Watson WolframAlpha

Category: Oscar Winning Movies Hint: Its final scene includes the line “I do wish we could chat longer, but I’m

having an old friend for dinner” !!!!

Answer: Silence of the Lamb

All men are mortal.Socrates probably is a man…

…Therefore, Socratesmight be mortal.(cognitive computing)

Florian Leitner

Information extraction: Knowledge mining for molecular biology.

18

BiologicalRepositories

Binary Interactions

Named Entity Recognition Entity Associations Entity Mapping

(Grounding)Relationship Extraction

Relationship Annotations

Cdk5 Rat

TaxID10116

UniProtQ03114

Experimental Methods

Article Classification

Biological Model

Articles

Short Factoid Question Answering

Ontologies & Thesauri

WWW

MITIE OpenDMAP ClearTK

Florian Leitner

Text mining and language processing is all about resolving ambiguities.

19

Anaphora resolutionCarl and Bob were fighting:

“You should shut up,” Carl told him.

Part-of-Speech taggingThe robot wheels out the iron.

ParaphrasingUnemployment is on the rise.

vs The economy is slumping.

Entity recognition & groundingIs Princeton really good for you?

Florian Leitner

Text mining and language processing is all about resolving ambiguities.

20

Anaphora resolutionCarl and Bob were fighting:

“You should shut up,” Carl told him.

Part-of-Speech taggingThe robot wheels out the iron.

ParaphrasingUnemployment is on the rise.

vs The economy is slumping.

Entity recognition & groundingIs Princeton really good for you?

Overview of text mining and NLP (+software)

Data & Analytics

Transcript of Overview of text mining and NLP (+software)