Rainbow Tool Kit

40
Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003

description

Rainbow Tool Kit. Matt Perry Global Information Systems Spring 2003. Outline. Introduction to Rainbow Description of Bow Library Description of Rainbow methods Naïve Bayes TFIDF/Rocchio K Nearest Neighbor Probabilistic Indexing Demonstration of Rainbow 20 newsgroups example. - PowerPoint PPT Presentation

Transcript of Rainbow Tool Kit

Page 1: Rainbow Tool Kit

Rainbow Tool Kit

Matt PerryGlobal Information Systems Spring 2003

Page 2: Rainbow Tool Kit

Outline1. Introduction to Rainbow2. Description of Bow Library3. Description of Rainbow methods

1. Naïve Bayes2. TFIDF/Rocchio3. K Nearest Neighbor4. Probabilistic Indexing

4. Demonstration of Rainbow1. 20 newsgroups example

Page 3: Rainbow Tool Kit

What is Rainbow?

Publicly available executable program that performs document classification

Part of the Bow (or libbow) library A library of C code useful for writing statistical text

analysis, language modeling and information retrieval programs

Developed by Andrew McCallum of Carnegie Mellon University

Page 4: Rainbow Tool Kit

About Bow Library

Provides facilities for Recursively descending directories, finding text files. Finding `document' boundaries when there are multiple

documents per file. Tokenizing a text file, according to several different

methods. Including N-grams among the tokens. Mapping strings to integers and back again, very efficiently. Building a sparse matrix of document/token counts. Pruning vocabulary by word counts or by information gain. Building and manipulating word vectors.

Page 5: Rainbow Tool Kit

About Bow Library Provides facilities for

Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.

Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning.

Scoring queries for retrieval or classification. Writing all data structures to disk in a compact format. Reading the document/token matrix from disk in an

efficient, sparse fashion. Performing test/train splits, and automatic classification

tests. Operating in server mode, receiving and answering queries

over a socket.

Page 6: Rainbow Tool Kit

About Bow Library

Does Not Have English parsing or part-of-speech tagging

facilities. Do smoothing across N-gram models. Claim to be finished. Have good documentation. Claim to be bug-free. Run on a Windows Machine.

Page 7: Rainbow Tool Kit

About Bow Library

In Addition to Rainbow, Bow contains 3 other executable programs Crossbow - does document clustering Arrow - does document retrieval – TFIDF Archer - does document retrieval

Supports AltaVista-type queries +, -, “”, etc.

Page 8: Rainbow Tool Kit

Back to Rainbow

Classification Methods used by Rainbow Naïve Bayes (mostly designed for this)

TFIDF/Rocchio

K-Nearest Neighbor

Probabilistic Indexing

Page 9: Rainbow Tool Kit

Description of Naïve Bayes

Bayesian reasoning provides a probabilistic approach to learning.

Idea of Naïve Bayes Classification is to assign a new instance the most probable target value, given the attribute values of the new instance.

How?

Page 10: Rainbow Tool Kit

Description of Naïve Bayes

Based on Bayes Theorem Notation

P(h) = probability that a hypothesis h holds Ex. Pr (document1 fits the sports category)

P(D) = probability that training data D will be observed

Ex. Pr (we will encounter document1)

Page 11: Rainbow Tool Kit

Description of Naïve Bayes

Notation Continued P(D|h) probability of observing data D given that

hypothesis h holds. Ex. Probability that we will observe document 1 given

that document 1 is about sports P(h|D) probability that h holds given training data

D. This is what we want Probability that document 1 is a sports document given

the training data D

Page 12: Rainbow Tool Kit

Description of Naïve Bayes

Bayes Theorem

)()()|()|(

DPhPhDPDhP

Page 13: Rainbow Tool Kit

Description of Naïve Bayes

Bayes Theorem Provides a way to calculate P(h|D) from P(h),

together with P(D) and P(D|h). Increases with P(D|h) and P(h) Decreases with P(D)

Implies that it is more probable to observe D independent of h.

Less evidence D provides in support of h.

Page 14: Rainbow Tool Kit

Description of Naïve Bayes Approach: Assign the most probable target value

given the attributes

),...,|(1max aav nj

Pv

valj

Page 15: Rainbow Tool Kit

Description of Naïve Bayes Simplification based on Bayes Theorem

)()|,...,(

),...,(

)()|,...,(

1

1

1

max

max

vvaaaavvaa

jjn

n

jjn

PPv

val

P

PP

vval

j

j

Page 16: Rainbow Tool Kit

Description of Naïve Bayes Naïve Bayes assumes (incorrectly) that the

attribute values are conditionally independent given the target value

)|()(max vav ji

ijPP

vval

j

Page 17: Rainbow Tool Kit

Rainbow Algorithm

Let )(viP = probability that a document belongs to class vi

Let

)|( vw jkP = probability that a

randomly drawn word from class will be the word

v jwk

Page 18: Rainbow Tool Kit

Rainbow Algorithm

Estimate

||1

)|(Vocabularyn

P nvw kjk

Page 19: Rainbow Tool Kit

Rainbow Algorithm1. Collect all words, punctuation, and other tokens

that occur in examples2. Calculate the required and

probability terms3. Return the estimated target value for the document

Doc

)|( vw jkP)(v jP

)|()(max vav ji

ijPP

vval

j

Page 20: Rainbow Tool Kit

TFIDF/Rocchio

Most major component of the Rocchio algorithm is the TFIDF (term frequency / inverse document frequency) word weighting scheme.

TF(w,d) (Term Frequency) is the number of times word w occurs in a document d.

DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.

Page 21: Rainbow Tool Kit

TFIDF/Rocchio The inverse document frequency is calculated

as

)log()( )(||wDF

DwIDF

Page 22: Rainbow Tool Kit

TFIDF/Rocchio

Based on word weight heuristics, the word wi is an important indexing term for a document d if it occurs frequently in that document

However, words that occurs frequently in many document spanning many categories are rated less importantly

Page 23: Rainbow Tool Kit

TFIDF/Rocchio Each document is D is represented as a vector

within a given vector space V:

),...,( |)(|)1( Fddd

Page 24: Rainbow Tool Kit

TFIDF/Rocchio

Value of d(i) of feature wi for a document d is calculated as the product

d(i) is called the weight of the word wi in the document d.

)(),()(ii

i wIDFdwTFd

Page 25: Rainbow Tool Kit

TFIDF/Rocchio

Documents that are “close together” in vector space talk about the same things.

t1

d1

d3

d5t2

θ

φ

t3 d2

d4

http://www.stanford.edu/class/cs276a/handouts/lecture4.ppt

Page 26: Rainbow Tool Kit

TFIDF/Rocchio Distance between vectors d1 and d2 captured

by the cosine of the angle x between them. Note – this is similarity, not distance

t 1

d2

d1

t 3

t 2

θ

http://www.stanford.edu/class/cs276a/handouts/lecture4.ppt

Page 27: Rainbow Tool Kit

TFIDF/Rocchio

Cosine of angle between two vectors The denominator involves the lengths of the vectors So the cosine measure is also known as the

normalized inner product

n

i kin

i ji

n

i kiji

kj

kjkj

ww

ww

dd

ddddsim

12,1

2,

1 ,,),(

http://www.stanford.edu/class/cs276a/handouts/lecture4.ppt

Page 28: Rainbow Tool Kit

TFIDF/Rocchio A vector can be normalized (given a length of 1) by

dividing each of its components by the vector's length This maps vectors onto the unit circle: Then, Longer documents don’t get more weight For normalized vectors, the cosine is simply the dot

product:

kjkj dddd

),cos(

http://www.stanford.edu/class/cs276a/handouts/lecture4.ppt

Page 29: Rainbow Tool Kit

Rainbow Algorithm

Construct a set of prototype vectors One vector for each class This serves as learned model Model is used to classify a new document D D is assigned to the class with the most

similar vector

Page 30: Rainbow Tool Kit

K Nearest Neighbor

Features All instances correspond to points in an n-

dimensional Euclidean space Classification is delayed until a new instance

arrives Classification done by comparing feature vectors

of the different points Target function may be discrete or real-valued

Page 31: Rainbow Tool Kit

K Nearest Neighbor

1 Nearest Neighbor

Page 32: Rainbow Tool Kit

K Nearest Neighbor

An arbitrary instance is represented by (a1(x), a2(x), a3(x),.., an(x)) ai(x) denotes features

Euclidean distance between two instancesd(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2)

Find the k-nearest neighbors whose distance from your test cases falls within a threshold p.

If x of those k-nearest neighbors are in category ci, then assign the test case to ci, else it is unmatched.

Page 33: Rainbow Tool Kit

Rainbow Algorithm

Construct a model of points in n-dimensional space for each category

Classify a document D based on the k nearest points

Page 34: Rainbow Tool Kit

Probabilistic Indexing

Idea Quantitative model for automatic indexing based

on some statistical assumptions about word distribution.

2 Types of words: function words, specialty words Function words = words with no importance for

defining classes (the, it, etc.) Specialty words = words that are important in

defining classes (war, terrorist, etc.)

Page 35: Rainbow Tool Kit

Probabilistic Indexing

Idea Function words follow a Poisson distribution over

the set of all documents Specialty words do not follow a Poisson

distribution over the set of all documents Specialty word distribution can be described by a

Poisson process within its class Specialty words distinguish more than one class

of documents

Page 36: Rainbow Tool Kit

Rainbow Method

Goal is to estimate P(C|si, dm) Probability that assignment of term si to the

document dm is correct Once terms have been identified, assign

Form Of Occurrence (FOC) Certainty that term is correctly identified Significance of Term

Page 37: Rainbow Tool Kit

Rainbow Method

If term t appears in document d and a term descriptor from t to s exists, s an indexing term, then generate a descriptor indictor

Set of generated term descriptors can be evaluated and a probability calculated that document d lies in class c

Page 38: Rainbow Tool Kit

Rainbow Demonstration

20 newsgroups example References

http://www.stanford.edu/class/cs276a/handouts/lecture4.ppt http://www-2.cs.cmu.edu/~mccallum/bow/ http://webster.cs.uga.edu/~miller/SemWeb/Project/ApMlPresent.ppt http://citeseer.nj.nec.com/vanrijsbergen79information.html http://citeseer.nj.nec.com/54920.html Mitchell, Tom M. Machine Learning. 1997

http://www-2.cs.cmu.edu/~tom/book.html

Page 39: Rainbow Tool Kit

Rainbow Commands Create a model for the classes:

rainbow -d ~/model --index training directory Classifying Documents:

Pick Method (naivebayes, knn, tfidf, prind ) rainbow -d ~/model --method=tfidf --test=1

Automatic Test: rainbow -d ~/model --test-set=0.4 --test=3

Test 1 at a time: rainbow -d ~/model –query test file

Page 40: Rainbow Tool Kit

Rainbow Demonstration Can also run as a server:

rainbow -d ~/model --query-server=port Use telnet to classify new documents

Diagnostics: List the words with the highest mutual info:

rainbow -d ~/model -I 10 Perl script for printing stats:

rainbow -d ~/model --test-set=0.4 --test=2 | rainbow-stats.pl