Text categorization

Text Categorization

Quang NguyenSaltlux Vietnam Development Center

Sept 17, 2010

Contents

Definition Text Categorization Division Dimensionality Reduction Machine Learning Approaches Text Categorization Evaluation

Definition

Definition Text categorization (TC – aka text classification,

or topic spotting): the activity of labelling natural language texts with thematic categories from a predefined set

Example

Formal Definition

TC assigns a boolean value to pair (dj, ci) DxC where D is a domain of documents and C = {c1, …, c|C|} is predefined set of categories.

Value is T (true): file dj under ci

Value is F (false): do not file dj under ci

Applications of Text Categorization

Document Organization Text Filtering Word Sense Disambiguation (WSD):

consider word as document, word senses as categories

Text Categorization Division

Single-label vs. multi-label TC Single-label: exactly 1 category must be

assigned to each djD Multi-label: any number from 0 to |C| may be

assigned to the same djD

Category-pivoted vs. document-pivoted TC Document-pivoted: given djD, find all ci C

under which it should be filed Category-pivoted: given ciC, find all dj D that

should be filed under it

Text Categorization Division

Hard categorization vs. ranking categorization

Hard: require T or F for each pair (dj, ci) Ranking: given djD, rank categories in C = {c1,

…, c|C|} according to their estimated appropriateness to dj

Rule-based categorization vs. Machine Learning categorization

Knowledge acquisition bottleneck from experts Can give very good results.

Machine Learning Approach to Text Categorization

To contruct a ranking classifier for djD, find all ci C, consisting of

Function CVSi: D [0, 1], value in [0,1] represents evidence for the fact that djci

Category threshold i such that CVSi(dj)≥i is interpreted as T and CVSi(dj)< i is interpreted as F

Training Set: classifier CVSi is inductively built by observing the characteristics of these documents

Test Set: used for testing the effectiveness of the classifiers

Document Representation

Document dj is usually represented as a vector

Most of system use TFIDF for wkj

),...,( ||1 jTjj wwd T is the set of terms (features)0 ≤ wkj≤1

||log).,(),(.

kjkjkkj tDF

DdtTFdtIDFTFw

Dimensionality Reduction

Definition: DR reduces the size of the vector space from || to |’|<<||; the set ’ is called reduced term set

Benefits Reduce index size Reduce overfitting

Distinction DR by Term Selection: ’ is subset of DR by Term Extraction: the terms in ′ are not of

the same type of the terms in (e.g. if the terms in are words, the terms in ′ may not be words at all) 10

DR by Term Selection

Latent Semantic Indexing

Use concepts instead of words Mathematical model

relates documents and the concepts Looks for concepts in the documents Stores them in a concept space

related documents are connected to form a concept space

Do not need an exact match for the query

Latent Semantic Indexing

Probabilistics Classifiers

View CSVi(dj): probability that a document belongs to ci using Bayes’ theorem

Problem: the number of possible vectors is too high.

To alleviate this problem, use independence assumption (called Naïve Bayes Classifiers)

)|()()|(

cdPcPdcP

Randomly picked document is jd

Randomly picked document is belongs to ci

Estimation of this value is problematic

)|()|(

kikjji cwPdcP

Naïve Bayes Classifiers

Use binary-valued vector representations for documents

Pki : short for P(wkj=1|ci), P(wkj|ci) is

Plug all things together

Defining classifier for ci requires estimating parameters {p1i, …,p||i} from the training data

Naïve Bayes Classifiers

cwPcwP

cwPcwPcwP

)|(1)|(

)|()|()|(

Decision Tree Classifiers

Is a tree in which internal nodes are labelled by terms, branches departing from them are labelled by tests on the weight that term has in the test document and leafs are categories

Decision Rule Classifiers

Classifier is built by an inductive rule learning method consists of a DNF rule

DNF rules are similar to DTs, but tend to generate more compact classifiers than DT learners

Rocchio Classifiers

Rocchio’s method computes a classifier as follows

Esy to implement, quite efficient Drawback: miss most of documents if

documents in category tend to occur in disjoint clusters

iii wwc ||1 ,...,

, : parameters adjusting importance of positive and negative examples

Neural Networks Classifiers

A Neural Network (NN) text classifier is a network of units.

Typical way of training NNs is backpropagation, whereby the term weights of training documents are loaded, if a misclassification occurs, the error is “backpropagated” to adjust parameters 20

Hidden Layers

K-Nearest Neighbour Classifiers

For deciding whether dj ci, Find k most similar documents of dj

Pick the most popular category among the k similar documents

Simple method that works well when the document similarity measure is accurate

Training data is not used to build a model

Training set

kNN(k=3) Classifier for new document

Rocchio vs. kNN

Rocchio (a): Miss most of documents as the centroid full outside of the clusters

kNN (b): overcome problem of Rocchio

Support Vector Machine

An Support Vector Machine (SVM) looks for a hyperplane with the maximum margin between positive and negative examples of training documents.

Optimal Hyperplane

Maximum Margin

Support Vectors

Support Vector Machine

Do not use all training documents, use only support vectors (documents near the border)

Applicable to the case in which positives and negatives are not linearly separable

Text Categorization Evaluation

Precision:

Recall

F1: harmonic means of precision and recall

ii FPTP

TPcpre

ii FNTP

TPcrec

recpre

Contigency Table for ci

Breakeven: the value at which precision is equal recall

Text Categorization Evaluation

ithresholdcategory

Breakeven point

References

Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, p.1~p.47, 2002.

Manu Konchady, Text Mining Application Programming, Charles River Media, 2006.

Thank you!

Text categorization

Documents

Transcript of Text categorization

1 Learning for Text Categorization. 2 Text Categorization Representations of text are very high dimensional (one feature for each word). High-bias algorithms.

Text categorization as graph

Text Categorization Hongning Wang CS@UVa. Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization.

Text categorization with Lucene and Solr

Issues in Text Similarity and Categorization

Text categorization as a graph

Statistical Text Categorization By Carl Sable. Text Classification Tasks Text Categorization (TC) - Assign text documents to pre-existing, well-defined.

Text Classification/Categorization

Text Categorization Methods for Automatic Estimation of ...

99 Text Categorization Methods

On feature distributional clustering for text categorization

Statistical Text Categorization

20070702 Text Categorization

Text categorization using Rough Set

Domain Adaptation for Text Categorization by Feature Labelingcocoa.ethz.ch/downloads/2015/05/None_0046352d714dcccfdb0000… · Domain Adaptation for Text Categorization by Labeling

Text Categorization (2)

Text Categorization Moshe Koppel Lecture 1: Introduction

Text categorization

Text Categorization (I) - Purdue University...Automatic text categorization Learn algorithm to automatically assign predefined categories to text documents /objects automatic or semi-automatic

Text Categorization - cs.virginia.edu