Machine Learning Vienna University May 30,...

Text miningMachine Learning

Vienna University

May 30, 2014

1

Text Mining

Text mining also (data mining, text analytics) refers to the process of deriving high-quality

information from text:

1 structuring the input text

parsingaddition of some linguistic featuresremoval of some linguistic featuresinsertion into a database

2 deriving patterns within the structured data3 evaluation and interpretation of the output.

Text mining tasks:

1 text categorization2 text clustering3 concept/entity extraction4 production of granular taxonomies5 sentiment analysis6 document summarization7 entity relation modeling (i.e., learning relations between named entities)

2

Information Retrieval

Corpus: D unit of text dj , j = 1, · · ·D : documents, sub-documents, paragraphs,

sentences or a window of a fixed number of words

Dictionary W : N different words {word1,word2, · · · ,wordN} that appear into the

corpus.

Information Retrieval (IR) is finding documents from corpus that satisfy an information

need.

Examples: Web search, E-mail search, searching your computer, corporate knowledge

bases, legal information retrieval

You need a query!

3

Fundamental Property of Text (Zipf law)

Tipf’s law means that given some corpus of natural language, very few words are

responsible for the largest proportion of a written text.

Illustration:

1242 tour descriptions with about 1.5 m words and 30 thousands of different words. The

frequency of fifty most frequent words is shown in left figure. The frequency of words in

relation to a number of words is shown in right figure.

4

Pre-processing of Texts

Main components of pre-processing: blue blocks are obligatory, yellow- if required

START

Multple Documents (Corpus)

Removal of irrevelant information

Removal of empty documents

Removal of identical documents

Filtering of English texts

Tokenization

Spellchecking

Removal of repeating characters

Replacing of contractions Replacement with synonyms

Replacement with antonyms

Change uppercase to lowercase

Removal of stopwords

Lemmatization

Stemming

Replacement with hyperonyms

Removal of unique tokens

Use of SentiWordNet

END

5

Term-Document Matrices

Element tdi,j of a term-document matrix td is 1, if document i contains word j and 0

otherwise.

The incidence matrix td is a huge matrix.

An inverted index instead of incidence matrix is used to store the matrix: each document

is identified by docID, a document serial number, and for each word i we store a list of

all documents that contain this word.

From matrix td we can construct:

the document-term matrix: dt = tdT

the term-term matrix: tt = td · dtthe document-document matrix: dd = dt · td

Quite often the corpus size D is smaller than the dictionary size N, so the document

representation can be more efficient.

The dual description correspond to the document representation view of the problem,

and the primal to the term representation.

In the dual, a document is represented as the counts of terms that appear in it. In the

primal, a term is represented as the counts of the documents in which it appears.

tti,j , i = 1,N, j = 1,N symmetric matrix that indicate in how many documents pairs of

words i and j co-occur;

tti,i indicates in how many documents word i occur.

ddi,j indicates how many common words have documents i and j .

ddi,i indicates how many words has document i .

6

The single word probability model (unigram model)

Maximum likelihood probability estimates for each word in the vocabulary:

p(wi) = counts(wi)/∑

j counts(wj)

count(wi) - the word frequency vector and∑

j counts(wj) - total number of words in

corpus.

The most frequent words from tour descriptions without stop words and a Word Cloud:

Word Frequency

day 15541

city 7590

time 5557

visit 4966

tour 4554

one 4311

take 3914

town 3877

local 3587

enjoy 3306

7

Word Co-Occurrences

pi,j denotes the joint probability for co-occurrence of word i and j , and pi denotes the

individual probabilities for occurrence of word i .

pi,j = tti,j/N, pi = tti,i/N

Pointwise mutual information (PMI):

Ii,j = log (pi,j/(pipj)) = log (tti,j/(tti ttj))

Ii,j = 0: words i, j are statistical independent;

Ii,j > 0: positive affinity effect, i.e. words tend to co-occur more often than expected;

Ii,j < 0: negative affinity or rejection effect, i.e. they tend to co-occur less often than

expected in the case of statistical independence.

8

Basic n-Gram Models

Previously co-occurrences were computed by disregarding the relative position in each

word pair under consideration. Now we will take the position of words into consideration.

As a consequence, the resulting co-occurrence matrix will not be a symmetric one.

We want to compute probabilities for larger conventional units of text such as sentences,

paragraphs and documents.

p(w) = p(w1,w2, ...wm) = p(w1)p(w2|w1)p(w3|w1,w2)...p(wm|w1,w2...wm−1)

Assuming the Markov property:

1-gram: p(w) ≈ p(w1)p(w2)p(w3)...p(wm)2-gram: p(w) ≈ p(w2|w1)p(w3|w2)p(w4|w3)...p(wm|wm−1)3-gram: p(w) ≈ p(w3|w2,w1)p(w4|w3,w2)...p(wm|wm−1,wm−2)n-gram: p(w) ≈

∏i p(wi |wi−1,wi−2...wi−n+1)

Maximum likelihood estimates can be easily computed for probabilities by using a

training corpus.

9

Accounting for Order

Table left: Bigrams from descriptions of tour scored by their raw frequency.

Table right: Trigrams on basis of PMI for tour descriptions.

Nr. First Word Second Word

1 national park

2 travel time

3 free day

4 nestimated travel

5 machu picchu

6 make way

7 time explore

8 topdeck trip

9 world heritag

10 time hours

First Word Second Word Third Word

day trek time

small village day

free day take

free time drive

relax day best

afternoon tour local

along route day

city take time

day hanoi tour

years day trip

10

Collocations on Basis of Likelihood Ratios

Likelihood ratio is a number that tells us how much more likely one hypothesis is than the

other.

We examine the following two alternative explanations for the occurrence frequency of a

bigram w1w2:

Hypothesis 1: P(w2|w1) = p = P(w2|¬w1)

Hypothesis 2: P(w2|w1) = p1 6= p2 = P(w2|¬w1)

Hypothesis 1 is a formalization of independence (the occurrence of w2 is independent of

the previous occurrence of w1);

Hypothesis 2 is a formalization of dependence which is good evidence for an interesting

collocation.

Assuming a binomial distribution, we get the likelihood for counts of w1,w2, w1 and w2.

11

Term-Frequency Matrices

We define a term-frequency matrix tf as a number of occurrences of a word i in

document j : tfi,j

Docs

Terms 127 144 191 194 211

buy 0 0 0 0 0

buyer 0 0 0 0 0

buyers 0 2 0 0 0

calendar 0 0 0 0 0

called 0 0 0 0 0

cambridge 0 1 0 0 0

Matrix tf doesn’t consider the ordering of words in a document. That’s why this model is

also called as the bag of words model.

12

Term-Document Count Matrices

But relevance of word does not increase proportionally with term frequency. Thus we

take log-frequency weighting:

wij =

{1 + log10 (tf(ti , dj)), if tf(ti , dj) > 0

0, otherwise

Score for a document-query pair: sum over words i in both query q and document j :

Score =∑

i∈q∩dj

(1 + log (tf(ti , dj)))

13

Term Frequency-Inverse Document Frequency (tf-idf)

Frequent terms are less informative than rare terms, so tti,i is an inverse measure of the

informativeness of word i .

We define the idf (inverse documents frequency) of i by idfi = log10 (D/tti,i)We use log (D/dfi) instead of D/dfi to “smooth” the effect of df .

tf-idf weight of a word i in document j is a product of its tf weight and its idf weight:

tf_idfi,j = (1 + log tfi,j )̇ log10 (N/tti,i)It is the best known weighting scheme in information retrieval.

The tf-idf weights of words are good indicators of importance, and they are easy and fast

to compute.

For final ranking of documents for a query q we become a score:

Score(q, dj) =∑

i∈q∩dj

tf_idfij

14

TF-IDF Cosine Score

How similar are our documents to each other? Once we are operating under the vector

space model, we can compute different types of distances between word vectors;

Euclidean, Hamming, Jaccard, cosine, and so forth.

Since we have represented documents as word vectors, we can find the cosine distance

between documents and treat the resulting angle as an estimate of similarity based on

tf-idf weighted word vectors.

Here is a formula how to calculate tf-idf cosine distance between two documents di

and dj :

Cosine Similarity(di , dj) =(di ,dj )‖di‖‖dj‖

with (di , dj) = (tf_idf1,i)(tf_idf1,j) + (tf_idf2,i)(tf_idf2,j) + · · ·+ (tf_idfN,i)(tf_idfN,j)

‖ di ‖=√(tf_idf1,i)2 + (tf_idf2,i)2 + · · ·+ (tf_idfN,i)2

‖ dj ‖=√(tf_idf1,j)2 + (tf_idf2,j)2 + · · ·+ (tf_idfN,j)2

With tf-idf Cosine Score we can find documents that are similar to each other.

15

Clustering

In pattern recognition problems, the training data consists of a set of input vectors x

without any corresponding target values. The goal in such unsupervised learning

problems may be to discover groups of similar examples within the data, where it is

called clustering.

ï£ij

16

Classification

In contrast to clustering, where groups are unknown at the beginning, classification tries

to put specific documents into groups known in advance.

Typical real-world examples are spam classification of e-mails or classifying news

articles into topics.

In the following, we give two examples:

1 a very simple classifier k-Nearest NeighborsIn k-NN classification an object is classified by a majority vote of its neighbors, with

the object being assigned to the class most common among its k nearest

neighbors (k is a positive integer, typically small).

2 a more advanced method: Support Vector Machines.

17

Non-negative matrix factorization(NMF)

A term-document matrix tf is factored into a term-feature and a feature-document matrix:

WH = tf

The features are derived from the contents of the documents, and the feature-document

matrix describes data clusters of related documents.

Matrix multiplication can be implemented as linear combinations of column vectors in W

with coefficients supplied by cell values in H. Each column in tf can be computed as

follows:

tfi =N∑

j=1

Hjiwj

ai is the i th column vector of the product matrix tf

Hji is the cell value in the j th row and i th column of the matrixH

wj is the j th column of the matrix W

When multiplying matrices, the dimensions of the factor matrices may be significantly

lower than those of the product matrix and it’s this property that forms the basis of NMF.

18

Statistical Bag-of-Words

We assume statistical independence among word occurrences.

General class of bag-of-word models is known as topic models. The basic idea of topicmodels is, that text do have a higher order (=latent semantic) structure which,however,

is obscured by word usage (e.g. through the use of synonyms or polysemy). By using

conceptual indices that are derived statistically via a truncated singular value

decomposition (a two-mode factor analysis) over a given document-term matrix, this

variability problem can be overcome.

We use statistical inference methods to training this type of models. Consider, for

instance, the following decomposition of the probability of a given document p:

p(d) =∑

z

p(d, z) =∑

z

p(z)p(d|z)

We represent the document probability as the result of marginalizing the joint probability

distribution p(d, z) over a hidden discrete variable z. This hidden or latent variable is

commonly referred to as the topic variable.

p(d) ≈∑

z

p(z)∏

n

p(wn|z)

19

Statistical Bag-of-Words

We are going to determine p(z) and p(w|z) from a given set of data with EM algorithm.

Initial step. Choose a number of topic and randomly generate some initial values for

p(z) and p(w|z).1. step (E-step). We compute p(z|d) from the topic probabilities p(z) and the

conditional probabilities of the words given the topics p(w|z):

p(z|d) = 1/γp(d|z)p(z) ≈ 1/γp(z)∏

n

p(wn|z)

where 1/γ is just a normalization because of condition:∑

z p(z|d) = 1

2. step (M-step). We estimate new values for both p(z) and p(w|z) from the p(z|d)probabilities obtained above and the word-per-document occurrences that are counted

over the dataset:

p(w|z) = 1/ξ∑

d c(d,w)p(z|d)p(z) = 1/ρ

∑d

∑w c(d,w)p(z|d)

where c(d,w) are the counts corresponding to the number of times each vocabulary

word w occurs in each document d of the collection, and 1/ξ and 1/ρ are normalization

factors for conditions∑

w p(w|z) = 1 and∑

z p(z) = 1.

20

Latent Semantic Analysis (LSA)

Dirichlet distribution:f (x1, · · · , xK−1;α1, · · · , αK ) =

1B(α)

∏Ki=1 xαi−1

i , on the open (K 1)-dimensional

simplex defined by:

x1, · · · , xK−1 > 0 (1)

x1 + · · ·+ xK−1 < 1 (2)

xK = 1− x1 − · · · − xK−1 (3)

and zero elsewhere.

The normalizing constant is the multinomial Beta function, which can be expressed in

terms of the gamma function:

B(α) =

∏K

i=1Γ(αi )

Γ(∑K

i=1αi), α = (α1, · · · , αK ).

Multinomial distribution: the probability mass function can be expressed using the

gamma function as:

f(x1, . . . , xk ; p1, . . . , pk) =Γ(

∑ixi +1)∏

iΓ(xi +1)

∏ki=1 pxi

i .

21

Latent Semantic Analysis (LSA)

LDA consists of the following three steps.

Step 1: The term distribution p(w|z) is determined for each topic by

p(w|z) ∼ Dirichlet(β).The dimensions of the parameters correspond to the number of topics k .

Step 2: The proportions p(z) of the topic distribution for the document d are determined

by p(z) ∼ Dirichlet(α).Step 3: For each of the N words wi

1 Choose a topic zi ∼ Multinomial(θ).

2 Choose a word wi from a multinomial probability distribution conditioned on the

topic zi : p(wi , zi , β)

β is the term distribution of topics and contains the probability of a word occurring in a

given topic.

22

Information Extraction

Information Extraction (IE) systems analyse unrestricted text in order to extract

information about pre-specified types of events, entities or relationships.

Part-of-Speech (POS) tagging

Special POS selections: select only JJ (Adjective) or only NN (Noun, singular or

mass) or assemble consecutive nouns

Noun Phrase (NP) - Chunking and Chinking: make a small grammar with regular

expression that catch a sentence part, as example, a individual noun phrase

(chunks):

Named Entity Recognition (NER): identify named entities and to classify them into

a set of predefined types such as: ORGANIZATION, PERSON, LOCATION, DATE,

TIME , MONEY , PERCENT, FACILITY , GPE

Entity Feature

Relation Extraction

NP

It PRP was VBD an DT amazing NN experience NN ! .

Figure : POS Parse tree

23

Sentiment analysis

The sentiment can be positive or negative and can be applied to product search.

Polarity Estimation In sentiment analysis we learn how the property, subjectivity or

sentiment of a sentence can be deduced from the words occurring in the sentence.

1 to identify the product property mentioned in the given sentence,

2 to identify the polarity of the opinion (i.e. the sentiment).

The identification of polarity can again be divided into

1 identifying subjective and objective sentences and

2 identifying the polarity as neutral, positive or negative

24

Semantic Orientation (SO)

For sentiment analysis we can calculate Semantic Orientation (SO) of documents from

corpus.

The SO measure for each individual phrase is calculated based on the affiliation with a

positive reference word excellent and negative reference word poor and is given by

SO(phrase) = PMI(phrase, ”excellent”)− PMI(phrase, ”poor”)

As approximative sentiment analysis we can consider the whole text as a one long

sentence and calculate the SO measure for the dictionary.

25

Machine Learning Vienna University May 30,...

Documents

Transcript of Machine Learning Vienna University May 30,...