Concept Ontology for Text Classification Bhumika Thakker.

Concept Ontology for Text Classification

Bhumika Thakker

Outline

Introduction Techniques Used Shrinkage in a Hierarchy of Classes Classification using very few words Enhanced Word Clustering for Hierarchical Text

Classification Classification of Web Content References

Introduction An explosion in availability of online information

with millions of documents on every topic easily accessible via the internet

The inability of people to assimilate and profitably utilize large amount of information is apparent due to the increase in available information

It becomes significant to organize this mass of information

Techniques for Classification

Shrinkage in a Hierarchy of Classes Classification using very few words Word Clustering for Hierarchical Text

Classification

Shrinkage in a Hierarchy of Classes Technique leverages commonly available topic

hierarchies in order to significantly improve classification accuracy

Works well even when hierarchy is large and the training data set is sparse

A method for exponentially reducing the amount of computation necessary for classification, sacrificing only a small amount of accuracy

Shrinkage in a Hierarchy of Classes The approach is to apply a technique from

Statistics called shrinkage It provides improved estimates of parameters The technique exploits a hierarchy by “shrinking”

parameter estimates in data sparse children toward the estimates of data-rich ancestors

Employs simple form of shrinkage that creates new parameter estimates in a child by a linear interpolation of all hierarchy nodes from the child to the root

Shrinkage in a Hierarchy of ClassesShrinkage for Text Classification It is used to better estimate the probability of a word

given a class θjt

For each node in tree a maximum likelihood (ML) estimate based on the data associated with that node is constructed

An improved estimate for each leaf node is then derived by “shrinking” its ML estimate towards the ML estimate of all its ancestors

A unigram model for each node in the tree is build and smoothed each leaf model by linearly interpolating it with all the models found along the path to the root

Cont…

The estimates along the path from the leaf to the root represent a tradeoff between specificity and reliability

The estimate at the leaf is the most specific as it is based on the data from the topic alone

Its least reliable as it is based on the least amount of data

The estimator at the root is the most reliable but the least specific

Cont…

Subtract each child’s data from its parent’s before calculating the parent’s ML estimate to ensure that the ML estimates along a given path are independent

The estimate is based on the data that belongs to all the siblings of said child but not to the child itself

Thus for any path from leaf to root every datum in the tree is used exactly one of the ML estimates providing both independence among the estimates and efficient use of the training data

Determining the weights

Let be k estimates where is the estimate at the leaf, and k-1 is the depth of class cj in the tree

The interpolation weights among the ancestors of the class cj are written where

for the new estimate of the class conditioned word probabilities based on shrinkage is:

The likelihood of data according to the mixture model is a convex function of weights and attains a single global maximum. This maximum for each leaf class cj, is calculated using the following

procedure

The algorithm can be viewed as a particularly simple form of EM (Expectation maximization)

Each datum is assumed to have been generated by first choosing one of the tree nodes in the path to the root say , using that estimate to generate the datum

EM then maximizes the total likelihood when the choices of estimates made for the various data are unknown

The first step in the iterative part is thus the E step and the second one is the M step

The method makes inefficient use of the available training data by carving off some of it to be used as a held-out set

To overcome this problem the algorithm is modified as follows:

All the available data is used both to construct the ML estimates and to optimize the weights

As each document is used in the above algorithm, the ML estimates are modified to exclude its data so as to make them independent of it

This method is known as “leave-one-out” or “jacknifing”

Experimental Results

The following figure shows classification accuracy on the industry sector data set with 50-50 train-test splits while varying vocabulary size

Larger vocabulary sizes generally perform better Hierarchical Feature Selection somewhat improves the

performance of the flat naive Bayes in mid range of feature selection at about 5000 words

Hierarchical feature selection accuracy reaches 64% Shrinkage improves classification accuracy of 74%

Shrinkage helps more when training data is sparse Following figure shows accuracy on the Newsgroups

data set with full vocabulary and varying amount of training data

Accuracy in this domain is highest with no feature selection for both classifiers with small amount of training data

Shrinkage provides more improvement when the amount of training data is small and the shrinkage reduces variance in the classifications

Estimates are improved by using shrinkage to smooth a class’s parameters with its ancestors

Following figure shows classification accuracy on Science hierarchy as a function of vocabulary size

Best performance for both the classifier is equal Among 51 with less than 50 training documents

shrinkage provides 6% improvement in accuracy, 39% for flat to 45% for hierarchical

Pruning Tree for Increased computational Efficiency The class hierarchy can be leveraged to

improve computational efficiency The classifier can avoid calculating

for a majority of the classes by pruning the tree dynamically during the classification of each document

Classification using very few words

Categorizing the different documents according to their topic, where topics are organized in a hierarchy of increasing specificity

The bottleneck in this classification tasks is the need for a person to read each document and decide on its appropriate place in the hierarchy

This is avoided by automatically classifying new documents using machine learning techniques

The approach is to divide the classification task into a set of smaller classification tasks

Each corresponds to some split in classification hierarchy

The key insight is that each subtask is significantly simpler than original task

The classifier at each node in the hierarchy needs to be distinguished between small number of categories

This is possible using small set of features

The ability to restrict to a very small feature set avoids many of the difficulties

Such models are more robust and less subject to overfitting

Thus they achieve better accuracy even for a very simple classifier such as Naive Bayes

Key note is not merely the use of feature selection but its integration with the hierarchical structure

Choosing small set of feature would not give good performance if used for classification for flattened class space

In hierarchical approach any document only sees a small fraction of the features throughout the process

The feature which it does see are divided so as to focus the attention of the classifier on the features relevant to the classification subtask at hand

Probabilistic Framework

Constructing a hierarchical set of classifier, each based on its own set of relevant features

It uses two main subroutines: A feature selection algorithm for deciding on the

appropriate feature set at each decision point A supervised learning algorithm for constructing a

classifier for that decision

Bayesian Classifiers

A Bayesian network allows us to provide compact descriptions of complex distributions over a large number of random variables

It uses directed acyclic graph to encode conditional independence assumptions about the domain

Independent assumptions allow the distribution to be described as a product of small local interaction models

Bayesian classifier is simply a Bayesian network applied to a classification domain

Contains a node C for the class variable and a node Xi for each of the features

Specific instance x, the Bayesian network allows to compute the probability for each possible ck

Bayes Optimal classification can be achieved by simply selecting class ck for which this probability is maximized

Feature Selection

Feature selection employs Information Theoretic measures to determine a subset of the original domain features that seem to best capture the class distribution in the data

For each feature Xi the algorithm determines the expected cross entropy:

where is the set of all domain features except Xi It eliminates the feature Xi for which is minimized

Feature Selection

The feature eliminated least disrupts the original conditional class distribution

This process can be iterated to eliminate as many features as desired

In this respect the algorithm is very applicable to text domains with many features

Experimental Results

Both hierarchical and flat classification schemes are ran on the datasets without employing any probabilistic feature selection

Original number of features in each data set and results are given in the following table

Two important phenomena are observed In Hier1 dataset, the very large number of features

used precludes the hierarchical scheme from performing better than simple flat method

In Hier2 dataset, the large number of features and small dataset size allows for more expressive KDB algorithm to overfit the training data

These results provide an empirical motivation for integration of feature selection

The table results shows that hierarchical method clearly outperforms the flat classification method when considering a direct comparison of the 10 and 20 feature runs

Hierarchical method produces 8-41% fewer errors than flat methods for Hier1 and more modest relative gains for Hier2

The results show that the feature selection stage does serve to focus the algorithm on the features relevant to the local classification task

The table shows the set of 10 features found to be most discriminating at each level of hierarchy learned for the Hier1 dataset

At top level of the hierarchy, High level terms are selected from various major topics

More specific words distinguishing the subtopics are seen

Enhanced Word Clustering for Hierarchical TextClassification

Distributional clustering of words/features is one of the ways to reduce dimensionality

Each word cluster can be treated as a single feature and thus dimensionality can be drastically reduced

Feature clustering is more effective than feature selection especially at lower number of features.

Feature clustering appears to preserve classification accuracy as compared to a full-feature classifier

In case of small training sets and noisy features, word clustering can actually increase classification accuracy.

The algorithms used are agglomerative in nature yielding sub-optimal word clusters at a high computational cost

DISTRIBUTIONAL WORD CLUSTERING C be a discrete random variable that takes on values

from the set of classes W be the random variable that ranges over the set of

words The joint distribution p(C,W) can be estimated from the

training set. Now suppose we cluster the words into k clusters W1,

…,Wk. As Interested in reducing the number of features and the model size, we only look at “hard" clustering where each word belongs to exactly one word cluster, i.e

The random variable W range over the word clusters. The information about C captured by W can be

measured by the mutual information I(C;W). Ideally, in formation of word clusters exact preservation of the mutual information is expected

However clustering usually lowers mutual information Thus its essential to find a clustering that minimizes the

decrease in mutual information, I(C;W) - I(C;W )

Classifying using word Clusters

The Naive Bayes method can be simply translated into using word clusters instead of words.

This is done by estimating the new parameters p(Ws/ci) for word clusters similar to the word parameters p(wt/ci) in as

When estimates of p(wt/ci) for individual words are relatively poor, the corresponding word cluster parameters p(Ws/ci) provide more robust estimates resulting in higher classification scores

The Naive Bayes rule (5) for classifying a test document d can be rewritten as

Figures 2 and 3 plot the fraction of mutual information lost against the number of clusters for both the divisive and agglomerative algorithms on the 20Ng and Dmoz data sets.

It can be seen that less mutual information is lost with Divisive Clustering compared to Agglomerative Clustering at all number of clusters, though the difference is more pronounced at lower number of clusters

Figure 5 shows classification accuracies on the 20 Newsgroups data set for the algorithms considered.

The horizontal axis indicates the number of features/clusters used in the classification model while the vertical axis indicates the percentage of test documents that were classified correctly

Divisive Clustering (SVM as well as NB) achieves significantly better results at lower number of features than feature selection using Information Gain and Mutual Information

With 50 clusters, Divisive Clustering (NB) achieves 78.05% accuracy

The largest gain occurs when the number of clusters equals the number of classes

In Figure 6, the classification accuracy is plotted on 20Ng data using Naive Bayes when the training data is sparse.

2% of the available data is taken, that is 20 documents per class, for training and tested on the remaining 98% of the documents

The results are averages of 5-10 trials Divisive Clustering obtains better results than Information Gain at all number

of features. It also achieves a significant 12% increase over the maximum possible

accuracy achieved by Information Gain. This is in contrast to Figure 5 where Information Gain eventually catches up

as the number of features increases When the training data is small the word by class frequency matrix contains

many zero entries. By clustering words, more robust estimates of word class probabilities are

obtained which lead to higher classification accuracies

Classification of Web Content

Utilizes known hierarchical structure, the classification problem is decomposed into a set of smaller problems corresponding to hierarchical splits in the tree

Aims to first learn to distinguish among classes at the top level, then lower level distinctions are learned only within the appropriate top level of the tree.

Each of these sub-problems can be solved much more efficiently, and more accurately as well

Classification of Web Content

The use of a hierarchical decomposition of a classification problem allows for efficiencies in both learning and representation.

Each sub-problem is smaller than the original problem It is sometimes possible to use a much smaller set of

features for each The hierarchical structure can also be used to set the

negative set for discriminative training and at classification time to combine information from different levels

SVM for Web Content

SVMs have been found to be efficient and effective for text classification, but not previously explored in the context of hierarchical classification.

The efficiency of SVMs for both initial learning and real-time classification make them applicable to large dynamic collections like web content

Hierarchical structure used in Web Content Hierarchical structure is used for two purposes

Train second-level category models using different contrast sets (either within the same top-level category in the hierarchical case, or across all categories in the flat non hierarchical case).

Combine scores from the top- and second-level models using different combination rules, some requiring a threshold to be exceeded at the top level before second level comparisons are made.

Classifying web search results

Classification techniques are used to automatically organize search results into existing hierarchical structures

Classification models are learned offline using a training set of human-labeled documents and web categories

Classification offers two advantages compared to clustering – run time classification is very efficient manually generated category names are easily understood

Constraints

To support goal of automatically classifying web search results two constraints Use of just the short summaries returned from web search engines.

since it takes too long to retrieve the full text of pages in a networked environment. These automatically generated summaries are much shorter than the texts used in most classification experiments, and they are much noisier than other document surrogates like abstracts that some have worked with.

Focus on the top levels of the hierarchy since we believe that many search results can be usefully disambiguated at this level.

Develop an interface that tightly couples search results and category structure are found to have large preference and performance advantages for automatically classified search results

TEXT CLASSIFICATION USING SVMs

Text classification involves a training phase and a testing phase.

In the training phase, a large set of web pages with known category labels are used to train a classifier.

An initial model is built using a subset of the labeled data, and a holdout set is used to identify optimal model parameters.

During the testing or operational phase, the learned classifier is used to classify new web pages.

A support vector machine (SVM) algorithm was used as the classifier

A linear SVM is a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin.

The margin is the distance from the hyperplane to the nearest of the positive and negative examples.

In the linearly separable case maximizing the margin can be expressed as an optimization problem

In cases where points are not linearly separable, slack variables are introduced that permit, but penalize, points that fall on the wrong side of the decision boundary

Problems that are not linearly separable, kernel methods can be used to transform the input space so that some non-linear problems can be learned.

The simplest linear form of the SVM can be used because it provided good classification accuracy, and is fast to learn and apply

Feature Selection in Web Content

Reduction in the feature space by eliminating words that appear in only a single document then selecting the 1000 words with highest mutual information with each category.

The mutual information MI(F, C) between a feature, F, and a category, C, is defined as:

Compute the mutual information between every pair of features and categories

SVM Parameters

In addition to varying the number of features, SVM performance is governed by two parameters C (the penalty imposed on training examples that fall on the wrong side

of the decision boundary) p (the decision threshold)

The Default C parameter value (0.01) is used The decision threshold, p, can be set to control precision and recall

for different tasks. Increasing p, results in fewer test items meeting the criterion, and

this usually increases precision but decreases recall. Conversely, decreasing p typically decreases precision but

increases recall. P is chosen so as to optimize performance on the F1 measure on a

training validation set.

Results

Decision thresholds were established on a training validation set For each category, if a test item exceeds the decision threshold, it

is judged to be in the category A test item can be in zero, one, or more than one categories From this precision (P) and recall (R) are computed These are micro-averaged to weight the contribution of each

category by the number of test examples in it For each test example, the probability of it being in each of the 13

top-level categories and each of the 150 second-level categories is computed

References

Improving Text Classication by Shrinkage in a Hierarchy of Classes by Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, Andrew Y. Ng http://www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz

Hierarchically classifying documents using very few words by Daphne Koller and Mehran Sahami

http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1997-75&format=pdf&compression=&name=1997-75.pdf

Hierarchical Classification of Web Content by Susan Dumais and Hao Chenhttp://research.microsoft.com/~sdumais/sigir00.pdf

Enhanced Word Clustering for Hierarchical Text Classification by Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumarhttp://www.cs.utexas.edu/users/inderjit/public_papers/hierdist.pdf

THANK YOU

Concept Ontology for Text Classification Bhumika Thakker.

Documents

Transcript of Concept Ontology for Text Classification Bhumika Thakker.