Concept Ontology for Text Classification Bhumika Thakker.
Transcript of Concept Ontology for Text Classification Bhumika Thakker.
![Page 1: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/1.jpg)
Concept Ontology for Text Classification
Bhumika Thakker
![Page 2: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/2.jpg)
Outline
Introduction Techniques Used Shrinkage in a Hierarchy of Classes Classification using very few words Enhanced Word Clustering for Hierarchical Text
Classification Classification of Web Content References
![Page 3: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/3.jpg)
Introduction An explosion in availability of online information
with millions of documents on every topic easily accessible via the internet
The inability of people to assimilate and profitably utilize large amount of information is apparent due to the increase in available information
It becomes significant to organize this mass of information
![Page 4: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/4.jpg)
Techniques for Classification
Shrinkage in a Hierarchy of Classes Classification using very few words Word Clustering for Hierarchical Text
Classification
![Page 5: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/5.jpg)
Shrinkage in a Hierarchy of Classes Technique leverages commonly available topic
hierarchies in order to significantly improve classification accuracy
Works well even when hierarchy is large and the training data set is sparse
A method for exponentially reducing the amount of computation necessary for classification, sacrificing only a small amount of accuracy
![Page 6: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/6.jpg)
Shrinkage in a Hierarchy of Classes The approach is to apply a technique from
Statistics called shrinkage It provides improved estimates of parameters The technique exploits a hierarchy by “shrinking”
parameter estimates in data sparse children toward the estimates of data-rich ancestors
Employs simple form of shrinkage that creates new parameter estimates in a child by a linear interpolation of all hierarchy nodes from the child to the root
![Page 7: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/7.jpg)
Shrinkage in a Hierarchy of ClassesShrinkage for Text Classification It is used to better estimate the probability of a word
given a class θjt
For each node in tree a maximum likelihood (ML) estimate based on the data associated with that node is constructed
An improved estimate for each leaf node is then derived by “shrinking” its ML estimate towards the ML estimate of all its ancestors
A unigram model for each node in the tree is build and smoothed each leaf model by linearly interpolating it with all the models found along the path to the root
![Page 8: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/8.jpg)
Cont…
The estimates along the path from the leaf to the root represent a tradeoff between specificity and reliability
The estimate at the leaf is the most specific as it is based on the data from the topic alone
Its least reliable as it is based on the least amount of data
The estimator at the root is the most reliable but the least specific
![Page 9: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/9.jpg)
Cont…
Subtract each child’s data from its parent’s before calculating the parent’s ML estimate to ensure that the ML estimates along a given path are independent
The estimate is based on the data that belongs to all the siblings of said child but not to the child itself
Thus for any path from leaf to root every datum in the tree is used exactly one of the ML estimates providing both independence among the estimates and efficient use of the training data
![Page 10: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/10.jpg)
Determining the weights
Let be k estimates where is the estimate at the leaf, and k-1 is the depth of class cj in the tree
The interpolation weights among the ancestors of the class cj are written where
for the new estimate of the class conditioned word probabilities based on shrinkage is:
![Page 11: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/11.jpg)
The likelihood of data according to the mixture model is a convex function of weights and attains a single global maximum. This maximum for each leaf class cj, is calculated using the following
procedure
![Page 12: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/12.jpg)
The algorithm can be viewed as a particularly simple form of EM (Expectation maximization)
Each datum is assumed to have been generated by first choosing one of the tree nodes in the path to the root say , using that estimate to generate the datum
EM then maximizes the total likelihood when the choices of estimates made for the various data are unknown
The first step in the iterative part is thus the E step and the second one is the M step
![Page 13: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/13.jpg)
The method makes inefficient use of the available training data by carving off some of it to be used as a held-out set
To overcome this problem the algorithm is modified as follows:
All the available data is used both to construct the ML estimates and to optimize the weights
As each document is used in the above algorithm, the ML estimates are modified to exclude its data so as to make them independent of it
This method is known as “leave-one-out” or “jacknifing”
![Page 14: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/14.jpg)
Experimental Results
The following figure shows classification accuracy on the industry sector data set with 50-50 train-test splits while varying vocabulary size
Larger vocabulary sizes generally perform better Hierarchical Feature Selection somewhat improves the
performance of the flat naive Bayes in mid range of feature selection at about 5000 words
Hierarchical feature selection accuracy reaches 64% Shrinkage improves classification accuracy of 74%
![Page 15: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/15.jpg)
![Page 16: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/16.jpg)
Shrinkage helps more when training data is sparse Following figure shows accuracy on the Newsgroups
data set with full vocabulary and varying amount of training data
Accuracy in this domain is highest with no feature selection for both classifiers with small amount of training data
Shrinkage provides more improvement when the amount of training data is small and the shrinkage reduces variance in the classifications
Estimates are improved by using shrinkage to smooth a class’s parameters with its ancestors
![Page 17: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/17.jpg)
![Page 18: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/18.jpg)
Following figure shows classification accuracy on Science hierarchy as a function of vocabulary size
Best performance for both the classifier is equal Among 51 with less than 50 training documents
shrinkage provides 6% improvement in accuracy, 39% for flat to 45% for hierarchical
![Page 19: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/19.jpg)
![Page 20: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/20.jpg)
Pruning Tree for Increased computational Efficiency The class hierarchy can be leveraged to
improve computational efficiency The classifier can avoid calculating
for a majority of the classes by pruning the tree dynamically during the classification of each document
![Page 21: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/21.jpg)
Classification using very few words
Categorizing the different documents according to their topic, where topics are organized in a hierarchy of increasing specificity
The bottleneck in this classification tasks is the need for a person to read each document and decide on its appropriate place in the hierarchy
This is avoided by automatically classifying new documents using machine learning techniques
![Page 22: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/22.jpg)
The approach is to divide the classification task into a set of smaller classification tasks
Each corresponds to some split in classification hierarchy
The key insight is that each subtask is significantly simpler than original task
The classifier at each node in the hierarchy needs to be distinguished between small number of categories
This is possible using small set of features
![Page 23: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/23.jpg)
The ability to restrict to a very small feature set avoids many of the difficulties
Such models are more robust and less subject to overfitting
Thus they achieve better accuracy even for a very simple classifier such as Naive Bayes
![Page 24: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/24.jpg)
Key note is not merely the use of feature selection but its integration with the hierarchical structure
Choosing small set of feature would not give good performance if used for classification for flattened class space
In hierarchical approach any document only sees a small fraction of the features throughout the process
The feature which it does see are divided so as to focus the attention of the classifier on the features relevant to the classification subtask at hand
![Page 25: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/25.jpg)
Probabilistic Framework
Constructing a hierarchical set of classifier, each based on its own set of relevant features
It uses two main subroutines: A feature selection algorithm for deciding on the
appropriate feature set at each decision point A supervised learning algorithm for constructing a
classifier for that decision
![Page 26: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/26.jpg)
Bayesian Classifiers
A Bayesian network allows us to provide compact descriptions of complex distributions over a large number of random variables
It uses directed acyclic graph to encode conditional independence assumptions about the domain
Independent assumptions allow the distribution to be described as a product of small local interaction models
![Page 27: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/27.jpg)
Bayesian classifier is simply a Bayesian network applied to a classification domain
Contains a node C for the class variable and a node Xi for each of the features
Specific instance x, the Bayesian network allows to compute the probability for each possible ck
Bayes Optimal classification can be achieved by simply selecting class ck for which this probability is maximized
![Page 28: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/28.jpg)
![Page 29: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/29.jpg)
Feature Selection
Feature selection employs Information Theoretic measures to determine a subset of the original domain features that seem to best capture the class distribution in the data
For each feature Xi the algorithm determines the expected cross entropy:
where is the set of all domain features except Xi It eliminates the feature Xi for which is minimized
![Page 30: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/30.jpg)
Feature Selection
The feature eliminated least disrupts the original conditional class distribution
This process can be iterated to eliminate as many features as desired
In this respect the algorithm is very applicable to text domains with many features
![Page 31: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/31.jpg)
Experimental Results
Both hierarchical and flat classification schemes are ran on the datasets without employing any probabilistic feature selection
Original number of features in each data set and results are given in the following table
![Page 32: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/32.jpg)
![Page 33: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/33.jpg)
Two important phenomena are observed In Hier1 dataset, the very large number of features
used precludes the hierarchical scheme from performing better than simple flat method
In Hier2 dataset, the large number of features and small dataset size allows for more expressive KDB algorithm to overfit the training data
These results provide an empirical motivation for integration of feature selection
![Page 34: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/34.jpg)
![Page 35: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/35.jpg)
The table results shows that hierarchical method clearly outperforms the flat classification method when considering a direct comparison of the 10 and 20 feature runs
Hierarchical method produces 8-41% fewer errors than flat methods for Hier1 and more modest relative gains for Hier2
![Page 36: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/36.jpg)
![Page 37: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/37.jpg)
The results show that the feature selection stage does serve to focus the algorithm on the features relevant to the local classification task
The table shows the set of 10 features found to be most discriminating at each level of hierarchy learned for the Hier1 dataset
At top level of the hierarchy, High level terms are selected from various major topics
More specific words distinguishing the subtopics are seen
![Page 38: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/38.jpg)
Enhanced Word Clustering for Hierarchical TextClassification
Distributional clustering of words/features is one of the ways to reduce dimensionality
Each word cluster can be treated as a single feature and thus dimensionality can be drastically reduced
Feature clustering is more effective than feature selection especially at lower number of features.
Feature clustering appears to preserve classification accuracy as compared to a full-feature classifier
In case of small training sets and noisy features, word clustering can actually increase classification accuracy.
The algorithms used are agglomerative in nature yielding sub-optimal word clusters at a high computational cost
![Page 39: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/39.jpg)
DISTRIBUTIONAL WORD CLUSTERING C be a discrete random variable that takes on values
from the set of classes W be the random variable that ranges over the set of
words The joint distribution p(C,W) can be estimated from the
training set. Now suppose we cluster the words into k clusters W1,
…,Wk. As Interested in reducing the number of features and the model size, we only look at “hard" clustering where each word belongs to exactly one word cluster, i.e
![Page 40: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/40.jpg)
The random variable W range over the word clusters. The information about C captured by W can be
measured by the mutual information I(C;W). Ideally, in formation of word clusters exact preservation of the mutual information is expected
However clustering usually lowers mutual information Thus its essential to find a clustering that minimizes the
decrease in mutual information, I(C;W) - I(C;W )
![Page 41: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/41.jpg)
![Page 42: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/42.jpg)
Classifying using word Clusters
The Naive Bayes method can be simply translated into using word clusters instead of words.
This is done by estimating the new parameters p(Ws/ci) for word clusters similar to the word parameters p(wt/ci) in as
![Page 43: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/43.jpg)
When estimates of p(wt/ci) for individual words are relatively poor, the corresponding word cluster parameters p(Ws/ci) provide more robust estimates resulting in higher classification scores
The Naive Bayes rule (5) for classifying a test document d can be rewritten as
![Page 44: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/44.jpg)
![Page 45: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/45.jpg)
Figures 2 and 3 plot the fraction of mutual information lost against the number of clusters for both the divisive and agglomerative algorithms on the 20Ng and Dmoz data sets.
It can be seen that less mutual information is lost with Divisive Clustering compared to Agglomerative Clustering at all number of clusters, though the difference is more pronounced at lower number of clusters
![Page 46: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/46.jpg)
![Page 47: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/47.jpg)
Figure 5 shows classification accuracies on the 20 Newsgroups data set for the algorithms considered.
The horizontal axis indicates the number of features/clusters used in the classification model while the vertical axis indicates the percentage of test documents that were classified correctly
Divisive Clustering (SVM as well as NB) achieves significantly better results at lower number of features than feature selection using Information Gain and Mutual Information
With 50 clusters, Divisive Clustering (NB) achieves 78.05% accuracy
The largest gain occurs when the number of clusters equals the number of classes
![Page 48: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/48.jpg)
![Page 49: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/49.jpg)
In Figure 6, the classification accuracy is plotted on 20Ng data using Naive Bayes when the training data is sparse.
2% of the available data is taken, that is 20 documents per class, for training and tested on the remaining 98% of the documents
The results are averages of 5-10 trials Divisive Clustering obtains better results than Information Gain at all number
of features. It also achieves a significant 12% increase over the maximum possible
accuracy achieved by Information Gain. This is in contrast to Figure 5 where Information Gain eventually catches up
as the number of features increases When the training data is small the word by class frequency matrix contains
many zero entries. By clustering words, more robust estimates of word class probabilities are
obtained which lead to higher classification accuracies
![Page 50: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/50.jpg)
Classification of Web Content
Utilizes known hierarchical structure, the classification problem is decomposed into a set of smaller problems corresponding to hierarchical splits in the tree
Aims to first learn to distinguish among classes at the top level, then lower level distinctions are learned only within the appropriate top level of the tree.
Each of these sub-problems can be solved much more efficiently, and more accurately as well
![Page 51: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/51.jpg)
Classification of Web Content
The use of a hierarchical decomposition of a classification problem allows for efficiencies in both learning and representation.
Each sub-problem is smaller than the original problem It is sometimes possible to use a much smaller set of
features for each The hierarchical structure can also be used to set the
negative set for discriminative training and at classification time to combine information from different levels
![Page 52: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/52.jpg)
SVM for Web Content
SVMs have been found to be efficient and effective for text classification, but not previously explored in the context of hierarchical classification.
The efficiency of SVMs for both initial learning and real-time classification make them applicable to large dynamic collections like web content
![Page 53: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/53.jpg)
Hierarchical structure used in Web Content Hierarchical structure is used for two purposes
Train second-level category models using different contrast sets (either within the same top-level category in the hierarchical case, or across all categories in the flat non hierarchical case).
Combine scores from the top- and second-level models using different combination rules, some requiring a threshold to be exceeded at the top level before second level comparisons are made.
![Page 54: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/54.jpg)
Classifying web search results
Classification techniques are used to automatically organize search results into existing hierarchical structures
Classification models are learned offline using a training set of human-labeled documents and web categories
Classification offers two advantages compared to clustering – run time classification is very efficient manually generated category names are easily understood
![Page 55: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/55.jpg)
Constraints
To support goal of automatically classifying web search results two constraints Use of just the short summaries returned from web search engines.
since it takes too long to retrieve the full text of pages in a networked environment. These automatically generated summaries are much shorter than the texts used in most classification experiments, and they are much noisier than other document surrogates like abstracts that some have worked with.
Focus on the top levels of the hierarchy since we believe that many search results can be usefully disambiguated at this level.
Develop an interface that tightly couples search results and category structure are found to have large preference and performance advantages for automatically classified search results
![Page 56: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/56.jpg)
TEXT CLASSIFICATION USING SVMs
Text classification involves a training phase and a testing phase.
In the training phase, a large set of web pages with known category labels are used to train a classifier.
An initial model is built using a subset of the labeled data, and a holdout set is used to identify optimal model parameters.
During the testing or operational phase, the learned classifier is used to classify new web pages.
![Page 57: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/57.jpg)
A support vector machine (SVM) algorithm was used as the classifier
A linear SVM is a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin.
The margin is the distance from the hyperplane to the nearest of the positive and negative examples.
![Page 58: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/58.jpg)
![Page 59: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/59.jpg)
In the linearly separable case maximizing the margin can be expressed as an optimization problem
In cases where points are not linearly separable, slack variables are introduced that permit, but penalize, points that fall on the wrong side of the decision boundary
Problems that are not linearly separable, kernel methods can be used to transform the input space so that some non-linear problems can be learned.
The simplest linear form of the SVM can be used because it provided good classification accuracy, and is fast to learn and apply
![Page 60: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/60.jpg)
Feature Selection in Web Content
Reduction in the feature space by eliminating words that appear in only a single document then selecting the 1000 words with highest mutual information with each category.
The mutual information MI(F, C) between a feature, F, and a category, C, is defined as:
Compute the mutual information between every pair of features and categories
![Page 61: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/61.jpg)
SVM Parameters
In addition to varying the number of features, SVM performance is governed by two parameters C (the penalty imposed on training examples that fall on the wrong side
of the decision boundary) p (the decision threshold)
The Default C parameter value (0.01) is used The decision threshold, p, can be set to control precision and recall
for different tasks. Increasing p, results in fewer test items meeting the criterion, and
this usually increases precision but decreases recall. Conversely, decreasing p typically decreases precision but
increases recall. P is chosen so as to optimize performance on the F1 measure on a
training validation set.
![Page 62: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/62.jpg)
Results
Decision thresholds were established on a training validation set For each category, if a test item exceeds the decision threshold, it
is judged to be in the category A test item can be in zero, one, or more than one categories From this precision (P) and recall (R) are computed These are micro-averaged to weight the contribution of each
category by the number of test examples in it For each test example, the probability of it being in each of the 13
top-level categories and each of the 150 second-level categories is computed
![Page 63: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/63.jpg)
References
Improving Text Classication by Shrinkage in a Hierarchy of Classes by Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, Andrew Y. Ng http://www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz
Hierarchically classifying documents using very few words by Daphne Koller and Mehran Sahami
http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1997-75&format=pdf&compression=&name=1997-75.pdf
Hierarchical Classification of Web Content by Susan Dumais and Hao Chenhttp://research.microsoft.com/~sdumais/sigir00.pdf
Enhanced Word Clustering for Hierarchical Text Classification by Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumarhttp://www.cs.utexas.edu/users/inderjit/public_papers/hierdist.pdf
![Page 64: Concept Ontology for Text Classification Bhumika Thakker.](https://reader036.fdocuments.us/reader036/viewer/2022062314/56649f165503460f94c2c31c/html5/thumbnails/64.jpg)
THANK YOU