Text Classification Based on Multi-word With Support Vector Machine
-
Upload
nadiarashid -
Category
Documents
-
view
122 -
download
0
Transcript of Text Classification Based on Multi-word With Support Vector Machine
![Page 1: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/1.jpg)
PRESENTED BY:NADIA RASHID ALOKKA
200653782
Text classification based on multi-word with support vector
machine
![Page 2: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/2.jpg)
Outline
IntroductionBackgroundText Representation and Multi-word
ExtractionText Representation StrategiesExperiments Main ResultsConclusion
![Page 3: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/3.jpg)
Introduction
With the rapid growth of online information, text classification has become one of the key techniques for handling and organizing text data.
Automated text classification utilizes a supervised learning method to assign predefined category labels to new documents based on the likelihood suggested by a trained set of labels and documents.
![Page 4: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/4.jpg)
Background
What is text classification?
It is a text mining technique. It is actually categorization the text depending
on some features.
![Page 5: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/5.jpg)
Background
What is text mining?Text mining refers to the process of deriving
high-quality information from text.
High-quality information is typically derived through the divining of patterns and trends through means such as statistical pattern learning
![Page 6: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/6.jpg)
Background
What is multi-word?
In simple words, it is a set of words that are related to each other.
![Page 7: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/7.jpg)
Background
What is a support vector machine?
SVM is a learning approach introduced in 1995 for solving two-class pattern recognition problem.
SVMs are a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis.
![Page 8: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/8.jpg)
More About SVM
A SVM constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.
![Page 9: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/9.jpg)
What do we mean by text representation?
Text representation is the process of transforming the unstructured texts into structured data as numerical vectors which can be handled by data mining techniques.
It is of strong impact on the generalization accuracy of a learning system.
![Page 10: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/10.jpg)
What is usually used for text representation?
Usually, bag of words (BOW) in vector space model is used to represent the text using individual words obtained from the given text data set.
As a simple and intuitive method, BOW method makes the representation and learning easy and highly efficient as it ignores the order and meaning of individual words.
![Page 11: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/11.jpg)
What is the problem in BOW?
Information patterns discovered by BOW are not interpretable and comprehensible because the linguistic meaning and semantics are not integrated into representation of documents.
![Page 12: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/12.jpg)
How to solve this problem?
Ontology
enhanced
representation.
•This method uses ontology to capture the concepts in the documents and integrate the domain knowledge of individual words into the terms for representation. Already, some works were done according to this idea.
Linguistic unit
enhanced
representat
ion
•This method makes use of lexical and syntactic rules of phrases to extract the terminologies, noun phrases and entities from documents and enrich the representation using these linguistic units
Word sequence
enhanced
representation.
•This method ignores the semantics in documents and treats the words as string sequences. Text representation using this method is either on words’ group based on co-occurrence or a word sequence extracted from documents by traditional string matching method
![Page 13: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/13.jpg)
Summary of what they have done?
A multi-word extraction method is developed based on the syntactical rules of multi-word firstly.
Documents are represented with these multi-words using different strategies.
A series of experiments are designed to examine the performances of text classification methods in order to evaluate the effectiveness of multiword representation.
![Page 14: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/14.jpg)
Multi-word Extraction
The linguistic method
which utilizes the structural properties of
phrases in sentence to extract the multi-words
from documents.
The statistical method
which based on cor pus learning
with MI for word
occurrence pattern
discovery.
Some other methods also combine both
linguistic knowledge and
statistical computation for
multi-word extraction
![Page 15: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/15.jpg)
Multi-word extraction method used in this paper
It is the regular expression for multi-word noun phrases
Suppose that A is an adjective, N is a noun P is a preposition
![Page 16: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/16.jpg)
Example
The U.S. agriculture department last December slashed its 12 month of 1987 sugar import quota from the Philippines to 143,780 short tons from 231,660 short tons in 1986.
The extracted multi-words are: ‘‘U.S. agriculture department” (NNN), ‘‘U.S. agriculture” (NN), ‘‘agriculture department” (NN), ‘‘last December” (AN), ‘‘sugar import quota” (NNN), ‘‘short tons” (AN) will be extracted from this sentence.
![Page 17: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/17.jpg)
The problem..and the solution
There are too many word sequences satisfying the criteria of the above regular expression.
The solution is to add another criteria beside this schema to differentiate.
The basic idea of finding the repetition pattern from two sentences is string matching.
![Page 18: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/18.jpg)
Example
Assuming we have two sentences as s1 is {A B C D E F G H} s2 is {F G H E D A B C} where a capital character represents an individual
word in a sen tence as is shown in Fig. 1
![Page 19: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/19.jpg)
![Page 20: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/20.jpg)
The computation complexity of this algorithm is O(mn) where m and n is the length of s1 and s2, respectively.
We also know that this algorithm could be improved with complexity as O(m + n).
![Page 21: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/21.jpg)
Text Representation
Decomposition strategy Combination
strategy
![Page 22: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/22.jpg)
Decomposition strategy
The short multi-words are used for representation.
A long multi-word will be eliminated from the feature set if it can be produced by merging the short multi-words extracted from the corpus
![Page 23: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/23.jpg)
Example: Decomposition strategy
‘‘U.S. agriculture department”
will be eliminated from the feature set because it can be replaced by
‘‘U.S. agriculture” and ‘‘agriculture department”.
![Page 24: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/24.jpg)
After normalize the multi-words into features with this strategy, the documents will be represented with term weights using the document frequencies (DF) of the multi-words because frequency is an important clue to determine the degree of relevance of a multi-word to the topic of a document, i.e., the category of a document
![Page 25: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/25.jpg)
Combination strategy
The long multi-words will be used for representation.
The short multi-words will be eliminated from the feature set they are included in long multi-words.
![Page 26: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/26.jpg)
Cont. Combination strategy
After feature normalization for the originally extracted multi-words with this method, a crucial problem confronted with us is that how to use the long multi-word for representation.
![Page 27: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/27.jpg)
Example: Combination strategy
If ‘‘U.S. agriculture department” is used for the feature and ‘‘U.S. agriculture” or ‘‘agriculture department” are eliminated, should we regard ‘‘U.S. agriculture department” is approximately the same as ‘‘U.S. agriculture” or ‘‘agriculture department” for representation?
To overcome the problem in representation mentioned above, dynamic k-mismatch is proposed
![Page 28: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/28.jpg)
Experiment
Text collection and preprocessing
Experiment design and setting
Results and evaluation
![Page 29: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/29.jpg)
Text collection and preprocessing
Reuters-21578 text collection was applied as experimental data.
The preprocessing we carried out for the assigned data includes stop word elimination, stemming and sentence boundary determination.
![Page 30: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/30.jpg)
Experiment design and setting
Training dataNumber of classifiersBenchmarks and baseline Repetition
![Page 31: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/31.jpg)
Main Results
Linear kernel outperforms non-linear kernel on whatever kind of representation method
In the multi-word representation, the combination strategy is superior to the decomposition strategy.
The effect of different representation strategies is more than the effect of different kernel functions on text classification
![Page 32: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/32.jpg)
Main Results
This outcome proves that representation using subtopics of general concepts can obtain better performance than representation using general concepts in text classification
![Page 33: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/33.jpg)
Conclusion
Multi-word is a newly exploited feature for text representation in the field of information retrieval and text mining
![Page 34: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/34.jpg)
Conclusion
The benefits of multi-word representation include at least three aspects. Firstly, it has lower dimension than individual words
but its performance is acceptable
Secondly, multi-word is easy to acquire from documents by corpus learning without any support of dictionary or ontology.
Thirdly, multi-word includes more semantics and is a larger meaningful unit than individual word.
![Page 35: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/35.jpg)
Conclusion
Need mathematical proves
![Page 36: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/36.jpg)
About the paper
![Page 37: Text Classification Based on Multi-word With Support Vector Machine](https://reader033.fdocuments.us/reader033/viewer/2022061103/541157647bef0a31688b4580/html5/thumbnails/37.jpg)