Post on 21-Jun-2015
description
1
Text Categorization
Quang NguyenSaltlux Vietnam Development Center
Sept 17, 2010
Contents
Definition Text Categorization Division Dimensionality Reduction Machine Learning Approaches Text Categorization Evaluation
2
Definition
Definition Text categorization (TC – aka text classification,
or topic spotting): the activity of labelling natural language texts with thematic categories from a predefined set
Example
3
TC
Formal Definition
TC assigns a boolean value to pair (dj, ci) DxC where D is a domain of documents and C = {c1, …, c|C|} is predefined set of categories.
Value is T (true): file dj under ci
Value is F (false): do not file dj under ci
4
Applications of Text Categorization
Document Organization Text Filtering Word Sense Disambiguation (WSD):
consider word as document, word senses as categories
5
Text Categorization Division
Single-label vs. multi-label TC Single-label: exactly 1 category must be
assigned to each djD Multi-label: any number from 0 to |C| may be
assigned to the same djD
Category-pivoted vs. document-pivoted TC Document-pivoted: given djD, find all ci C
under which it should be filed Category-pivoted: given ciC, find all dj D that
should be filed under it
6
Text Categorization Division
Hard categorization vs. ranking categorization
Hard: require T or F for each pair (dj, ci) Ranking: given djD, rank categories in C = {c1,
…, c|C|} according to their estimated appropriateness to dj
Rule-based categorization vs. Machine Learning categorization
Knowledge acquisition bottleneck from experts Can give very good results.
7
Machine Learning Approach to Text Categorization
To contruct a ranking classifier for djD, find all ci C, consisting of
Function CVSi: D [0, 1], value in [0,1] represents evidence for the fact that djci
Category threshold i such that CVSi(dj)≥i is interpreted as T and CVSi(dj)< i is interpreted as F
Training Set: classifier CVSi is inductively built by observing the characteristics of these documents
Test Set: used for testing the effectiveness of the classifiers
8
Document Representation
Document dj is usually represented as a vector
Most of system use TFIDF for wkj
9
),...,( ||1 jTjj wwd T is the set of terms (features)0 ≤ wkj≤1
)(
||log).,(),(.
kjkjkkj tDF
DdtTFdtIDFTFw
Dimensionality Reduction
Definition: DR reduces the size of the vector space from || to |’|<<||; the set ’ is called reduced term set
Benefits Reduce index size Reduce overfitting
Distinction DR by Term Selection: ’ is subset of DR by Term Extraction: the terms in ′ are not of
the same type of the terms in (e.g. if the terms in are words, the terms in ′ may not be words at all) 10
DR by Term Selection
11
Latent Semantic Indexing
Use concepts instead of words Mathematical model
relates documents and the concepts Looks for concepts in the documents Stores them in a concept space
related documents are connected to form a concept space
Do not need an exact match for the query
12
Latent Semantic Indexing
13
Probabilistics Classifiers
View CSVi(dj): probability that a document belongs to ci using Bayes’ theorem
Problem: the number of possible vectors is too high.
To alleviate this problem, use independence assumption (called Naïve Bayes Classifiers)
14
)(
)|()()|(
j
ijiji
dP
cdPcPdcP
Randomly picked document is jd
Randomly picked document is belongs to ci
Estimation of this value is problematic
||
1
)|()|(
kikjji cwPdcP
Naïve Bayes Classifiers
Use binary-valued vector representations for documents
Pki : short for P(wkj=1|ci), P(wkj|ci) is
Plug all things together
Defining classifier for ci requires estimating parameters {p1i, …,p||i} from the training data
15
Naïve Bayes Classifiers
16
kjkj
jkjk
jkjk
wki
wki
dwiki
dwk
idw
kidw
kikj
pp
cwPcwP
cwPcwPcwP
1)1(
)|(1)|(
)|()|()|(
Decision Tree Classifiers
Is a tree in which internal nodes are labelled by terms, branches departing from them are labelled by tests on the weight that term has in the test document and leafs are categories
17
Decision Rule Classifiers
Classifier is built by an inductive rule learning method consists of a DNF rule
DNF rules are similar to DTs, but tend to generate more compact classifiers than DT learners
18
Rocchio Classifiers
Rocchio’s method computes a classifier as follows
Esy to implement, quite efficient Drawback: miss most of documents if
documents in category tend to occur in disjoint clusters
19
iii wwc ||1 ,...,
, : parameters adjusting importance of positive and negative examples
Neural Networks Classifiers
A Neural Network (NN) text classifier is a network of units.
Typical way of training NNs is backpropagation, whereby the term weights of training documents are loaded, if a misclassification occurs, the error is “backpropagated” to adjust parameters 20
t1
t2
t||
c1
c2
c|C|
Hidden Layers
K-Nearest Neighbour Classifiers
For deciding whether dj ci, Find k most similar documents of dj
Pick the most popular category among the k similar documents
Simple method that works well when the document similarity measure is accurate
Training data is not used to build a model
21
c1c2
c3
Training set
c1c2
c3
kNN(k=3) Classifier for new document
c1c2
c3
Rocchio vs. kNN
Rocchio (a): Miss most of documents as the centroid full outside of the clusters
kNN (b): overcome problem of Rocchio
22
Support Vector Machine
An Support Vector Machine (SVM) looks for a hyperplane with the maximum margin between positive and negative examples of training documents.
23
Optimal Hyperplane
Maximum Margin
Support Vectors
Support Vector Machine
Do not use all training documents, use only support vectors (documents near the border)
Applicable to the case in which positives and negatives are not linearly separable
24
Text Categorization Evaluation
Precision:
Recall
F1: harmonic means of precision and recall
25
ii
ii FPTP
TPcpre
)(
ii
ii FNTP
TPcrec
)(
recpre
recpre
recpre
cF i
..211
2)(1
Contigency Table for ci
Breakeven: the value at which precision is equal recall
Text Categorization Evaluation
26
Pi,Ri
ithresholdcategory
1.0
1.0
Pi
Ri
Breakeven point
References
Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, p.1~p.47, 2002.
Manu Konchady, Text Mining Application Programming, Charles River Media, 2006.
27
Thank you!
28