DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and...

44

Transcript of DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and...

Page 1: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The
Page 2: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The
Page 3: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The
Page 4: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The
Page 5: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

Contents1 Introduction 6

1.1 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Important Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Method 92.1 Experimental study . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Dataset for the Experiments . . . . . . . . . . . . . . . . . 102.1.2 Quality measures . . . . . . . . . . . . . . . . . . . . . . . 102.1.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.4 Machine Architecture . . . . . . . . . . . . . . . . . . . . . 18

3 Results 193.1 Results from the Experimental Study . . . . . . . . . . . . . . . . . 19

3.1.1 Comparing Weighting Methods of Terms in BOW features . 193.1.2 The Effect of Case conversion and Removal of Stop Words . 233.1.3 Number of Words to use in Terms for BOW features . . . . 273.1.4 Features for Counting Characters and Quotation Marks . . . 31

4 Discussion 354.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 References 39

Appendices 42

A Classes for the Classifier Designs 42

B Python Build Dependencies 43

C Confusion Matrix 44

Page 6: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

1 IntroductionSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: Aspace odyssey” are all scary displays of ideas of Artificial Intelligence (AI) from themovie scene. It is probably what comes to mind when people think of AI as well.However, this thesis is not about the topic of superintelligence but rather it coversthe AI in understanding human written text. This kind of AI belong to the researchfield of Natural Language Processing (NLP), where ”natural language” refers toany language used by humans to communicate [BKL09e]. The language that wasprocessed in this project was English.

Creating intelligent systems that can understand text and natural language hasbeen around since the beginning of computers. Alan Turing wrote the article ”Com-puting Machinery and intelligence” in the year of 1950 and ever since, it has been ascientific topic researched and experimented on [Tur50]. Probably one of the mostwell known implementations that uses NLP is IBM’s Watson, which is described asa ”question answering” system. A computer participated using Watson in the Amer-ican quiz show ”Jeopardy!” and managed to beat two previous winners of the show[Mar11].

The background for this project was connected to the NLP task of matchingnatural language queries with user profiles. This project was divided in two studies,an experimental and a case study together with Thingmap. Where the results fromthe experimental study were applied to Thingmaps solution that maps queries tousers, to see if any improvement was gained.

The approach was to increase the context of natural language queries throughtext classification. An attempt to categorize short texts into multiple classes. Yet,complete solutions for text classification did not seem to be suitable for trainingshort texts classifiers, since less information is given than for large texts and docu-ments. A lack of theoretical support was discovered, of how to design a classifica-tion system for short texts and how the text should be represented to achieve optimalresults. The experimental study’s purpose was to bridge this percieved research gap,to find out how to represent text and how to design a short text classification formultiple classes.

This thesis presents experiments on how to represent the text as features forshort text classification, and comparisons on how a flat design of classification stoodagainst a hierarchical designed classification system.

6

Page 7: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

1.1 DelimitationsA dataset with short texts and a large number of samples was the setup for theexperiments. Since the number of samples is large, adding detailed features is costlyfor training time. A consequence of this was a limited level of detail on text featuresin the experiments.

Another consequence was that the suitable algorithms for statistical learning wasreduced by using a larger dataset. The approximation method Stochastic GradientDescent (SGD) was chosen and no other method was compared. SGD is consid-ered an efficient method for large datasets [Bot10]. The SGD together with logisticregression will be explained in the method in section 2.1.3.

The motivation for the choice of text features that was included in the study,was a combination of advice from supervisors and concepts for the most succesfulsolutions from a competition in short text classification hosted by Kaggle in 2013[KAG13].

1.2 Related WorkIn an article that was published 2016, the authors write about a Java library that theyhave developed. The library is called ”Edison”, which serves flexibility for the userto specify and implement different feature extractors for text classification or clus-tering purposes [SCK+16]. Edison supports a variety of NLP tools such as NamedEntity Recognition, Part-of-speech tagging and to detect numerical expressions intext. Edison is a ready-to-use solution with some powerful opportunities to config-ure feature extractors. The paper about Edison focuses on how to simplify featureextraction. While this paper was more about comparing different text features, forshort text classification.

Another project implemented a technique to perform classification on hierarchi-cally structured data, where the data were organized in a hierarchy of increasingspecificity. The authors tested the hierarchical classification system and they triedto implement a feature for class similarity. They claimed that the accuracy of theirimplemented classifier was better than a traditional design of classification [WZH].

1.3 Important ConceptsThe classification in this thesis, consisted of supervised statistical learning. In su-pervised learning, a model receives an input X, an output Y, and the model attemptsto map from input to output based on previous attempts [Alp14]. By comparing theguessed output with the real output for each sample of X and Y, the model measures

7

Page 8: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

the loss of each prediction. The loss regulates how the model parameters are ad-justed for future predictions [Sl16a]. The learning phase is referred to, as training inthis thesis. A model that finished training is called a classifier, and its task is to, fora given input X, predict a class among known classes from the training phase.

For it to be possible to classify an object, the object has to be represented in away that the model can understand. Whether it is a classifier that should recognize aface in an image or detect if an e-mail is spam or not, the object is represented withso called features. Before training, the features to extract from the object is specified[BKL09f]. With other words, feature extraction is how a classifier understands theobject, thus, an important part of classifying it.

The most common type of features for text purposes is called Bag of Words(BOW). The concept is for the feature to represent the occurring terms in a text.By going through the text and creating a vocabulary of the terms encountered, thennoting, for each text which terms was found [BMG10]. BOW features can be variedin different ways, with weighting methods, or extending the number of words aterm is made up of and also by excluding some words from the vocabulary. Theseare some examples that will be examined in the experimental study.

A contemporary survey of methods in machine learning is given in [HTF09].The book [MS99] reviews methods for dealing with natural language. The book[MRS08] discusses methods of information retrieval

8

Page 9: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

2 MethodThe method consisted of an experimental study on two different designs of text clas-sifiers and the choice of text features for short text classification. The measurementsused in the experimental study were: precision, recall, Reciever Operation Charac-teristics (ROC) and Area Under the Curve (AUC) of ROC.

In the thesis, all equations and formulas are indexed to the right for referencingpurposes.

2.1 Experimental studyFor the experiments, a dataset with english titles from Stack Exchange1 (SE) poststhat was labeled among 17 classes. The dataset was separated with one part for su-pervised training and one part for testing and evaluating the classifier’s performance.

Two different designs of classification was compared in the experimental study.A flat design which predicted among 17 subclasses2 in contrast to a 2-level hierar-chical design. The hierarchical design first predicted among 4 main classes, wherethe prediction lead to the second level classifier, which classified the subclass. Foreach mainclass there was a varying number of subclasses. The subclasses togethermade up the same set of 17 classes that the flat designed classifier predicted. Train-ing and testing both classifier designs was done with the same datasets, which madeit possible for direct comparison of the results.

The experimental study consisted of four different experiments. The order of theexperiments was very important, because each result lead to how the settings for thenext experiment was set. For an example: In experiment 1, two different weight-ing methods, A and B, was compared. If weighting method A gave better results,method A was then used as weighting method for experiment 2 where somethingelse was experimented on. This was an attempt to implement a classifier performingas well as possible in the last experiment.

Except for comparing the classifier designs for each experiment, the study con-sisted of four parts. These parts were: 1) comparing weighting methods of theterms in BOW features; 2) the effect of letter case conversion and removal of stopwords; 3) the amount of words to use in each term for BOW features; 4) the effectsof adding features for the number of characters in a text and how many quotationmarks occurred in the text.

1An online community that hosts ca. 150 Question-and-Answer sites, [SE16a]2See all the classes in appendix A.

9

Page 10: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

2.1.1 Dataset for the Experiments

The dataset was based on the openly distributed data dump from SE, the contents oftheir sites through ”Stack Exchange Data Dump”. The data dump holds data fromall the user-contributed content from the sites on the SE network [SE16b]. The datadump that was used for this project was uploaded September 12th, 2016. In thisproject the SE data dump came in the form of XML-files which had to be processedto be able to use as training and testing data for the system. For the experimentalstudy, 3.2 million samples of SE titles were used.

Supervised training means that the training samples are labeled with a class. Forthe dataset used in this project, the samples was labeled among 4 main classes and 17subclasses. The flat designed classifier focused on the subclasses only. The datasetwas divided in a training set of 80% of the total, meaning that all the classifiers ofthe experiments were trained with the same 2 563 571 samples. The remaining 20%was set aside for testing the prediction performances of the classifiers. During thetraining phase for the classifiers of the experiments, no exposure of the 641 206 testset samples occurred.

The samples consisted of short english texts with between 30 and 150 characters.An example of what a sample looked like: ”Statistical Dimension of a Cone”, wherethe sample was labeled with the class: Mathematics and Statistics.

2.1.2 Quality measures

When choosing measures for performance evaluation of classifiers, precision andrecall are the most common. However, these have been criticized for not takingaccount for error costs and over-represented classes [Faw06]. Therefore, RecieverOperation Characteristics (ROC) curve from signal detection was added to the mea-sures that was used to balance the quality measures. The measures are thoroughlyexplained in the following subsections.

Confusion MatrixMost quality measures for predictive modeling and classification are derived fromthe Confusion matrix (also called coincidence matrix or classification matrix). Thematrix is a n×n where n is the number of classes. The rows holds the true classeswhile the columns holds the predicted classes. The optimal result of a predictivemodel would be shown in a diagonal confusion matrix, that is: all zeros except forthe diagonal between (1,1) and (n,n) [Tin10].

From the confusion matrix, the predictions of a classifier can be analyzed. It canshow which classes the prediction model had problem identifying, which classes

10

Page 11: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

was mistaken for one another and which classes were easy to predict.

Precision, Recall, and F1-scoreHaving the confusion matrix as foundation, it was possible to calculate additionalmeasures. Precision and recall are frequently used in information retrieval and pat-tern recognition [BKL09a]. The measures are applicable for binary classes wherethe outcoming class is either positive or negative. The formula for precision:

Precision =T P

T P+FP(1)

Where TP is the true positives, the number of positive predictions that werecorrect. FP is the number of predicted positives that were incorrect. The recallmeasure was calculated as:

Recall =T P

T P+FN(2)

TP, is true positives and FN are the number of predicted negatives that was in-correct. F1-score was also used, it is a harmonic mean of precision (P) and recall(R). The formula for F1-score is:

F1 =2 ·P ·RP+R

(3)

With words, precision is the rate that a positive prediction of a certain class iscorrect. Recall is the ratio of how many samples of total samples belonging to classthat are predicted. Following in table 1, is an example of how to derive precisionand recall from a confusion matrix.

Table 1: Example of a Confusion Matrix for a classifer with classes: A, B and C. Theclassifier predicted 26 samples, with the columns representing the predicted classes and therows show the true classes.

n = 26 Predicted A Predicted B Predicted CTrue A 4 1 0True B 6 6 1True C 2 3 3

In table 1, column 2 shows the samples that were predicted to belong to class B.The true positives (TP) is highlighted with a circle, T PB = 6, following, the other

11

Page 12: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

predicted B’s are false positives (FP), hence FPB = 1+3 = 4 . Row 2 shows thesamples that truly belong to class B, the true B’s that are not predicted positves arefalse negatives (FN), FNB = 6+1 = 7. Having the values needed for calculation, P,R and F1-score for class B can be calculated:

PB =6

6+4= 0.6000 (4)

RB =6

6+7= 0.4615 (5)

F1B =2 ·0.6 ·0.46150.6+0.4615

= 0.5217 (6)

These values are calculated for all classes, and shown in table 2.

Table 2: Precision, Recall, F1-score calculated for each class A,B and C from values of ??.Also the number of samples n, for each class is presented.

n = 26 Precision Recall F1-score nA 0.3333 0.8000 0.4706 5B 0.6000 0.4615 0.5217 13C 0.7500 0.3750 0.5000 8

The resulting precision, recall and F1-score for the classifiers in the experimentsare averages for all classes. Where the precision, recall and F1-score of class k areweighted on the nk number of samples for that class, summed and finally divided bytotal number of test samples N. Formula for weighted average of precision:

P =1N

17

∑k=1

Pk ∗nk (7)

For the example results in table 2, the weighted average precision, was accord-ingly to equation 7.

P =0.3333 ·5+0.6 ·13+0.75 ·8

26= 0.5949 (8)

12

Page 13: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

Receiver Operating CharacteristicsWith origins in signal detection, the receiver operating characteristics curve (ROCcurve) is a plot for binary prediction that also reflects on the costs of false positivepredictions. It is a measure that plots recall (or true positive rate (TPR)) againstfalse positive rate (FPR). Unlike precision and recall, the ROC curve will give a justmeasure for test sets with unbalanced classes [Faw06]. TPR is the same as Recalland FPR is calculated as:

FPR =FP

FP+T N(9)

Where FP is the number of false positives and TN is the number of predictednegatives that was correct. The ROC curve is purposed for a binary classifier thatgives probabilities for its predictions. When plotting an ROC curve for a test, theresulted predictions should come as a vector of probabilities. Where a value closeto 1 means that a positive predictions is likely and a value close to 0 means that theclassifier is confident that the sample is negative. The second parameter for an ROCcurve is a binary vector of the true classes [Sl16c]. The threshold value decides atwhich probability a positive prediction is given, if the threshold is 0.4 the classifierwill predict positives for all probabilities above 0.4. The threshold value is sweptfrom 0 to 1, and for each step, the TPR and FPR is plotted out.

While the ROC curve visually displays the performance of a classifier, a numer-ical value can be easier to compare. That is why the Area Under the Curve (AUC)often is calculated for the ROC curve to indicate the performance of a classifier fora certain test run. The AUC will be a value between 0 and 1. A random classi-fier would get an ROC as a line from the point (0,0) and (1,1) with AUC = 0.5,which means that an AUC better than 0.5 is needed to perform better than a randomclassifier [Fla10].

However, ROC is made for binary classification and since this project deals withmulticlass classification, a modification of the ROC had to be done. The modifica-tion was to let the classifier predict classes for the test dataset. Then the true classesand predicted classes are compared, then a binary vector is created holding 0 or 1 de-pending on if the prediction was correct or not. The probability vector is containingthe probability that the classifier gave the most likely label.

The random baseline follows the same procedure of plotting an ROC curve buton a random classifier. The random classifier distributes 17 random probabilitieswhere: ∑

17i=1 p(i) = 1 the highest probability will be the class index that is predicted.

The idea is to get a random classifier that gives an ROC curve as the line between(0,0) and (1,1).

13

Page 14: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

AUC values for a large number of samples (as in the case of the test dataset), areconsidered as normally distributed [HM82]. For calculating the standard error (SE)of an AUC value, Hanley and McNeil’s formula was applied:

SE(AUC)=

√AUC(1−AUC)+(ntrue−1)(Q1−AUC2)+(n f alse−1)(Q2−AUC2)

ntrue ·n f alse(10)

Where ntrue is the number of correct predictions and n f alse respectively is numberof false predictions. Q1 and Q2 are calculated in equations 11 and 12.

Q1 =AUC

2−AUC(11)

Q2 =2 ·AUC2

1+AUC(12)

Hypothesis testing was conducted for the ROC analysis for all experiments. Byperforming z-tests on the differences of two AUC values, to see if the differencewas separated from zero with distinction. The hypothesis test used the fact that thedifference Z of two random variables X ,Y , from independent normal distibutions:X ∼ N(µX ,σ

2X), Y ∼ N(µY ,σ

2Y ) leads to a Z that also is normally distributed [ES08]:

Z ∼ N(µX −µY ,σ2X +σ

2Y ) (13)

This rule made it possible to determine wether two classifier’s AUC values dif-fered with statistical significance. The null hypothesis was that the difference of twoAUC values was equal to 0. A right tailed z-test was used to get the p-value for thedifference of two AUC values and the normal distribution X ∼ N(0,σ2

AUC1+σ2

AUC2).

A p-value smaller than 0.05 resulted in rejection of the null hypothesis, establishingthat the difference of AUC1,AUC2 was of statistical significance. Respectively, alarger p-value than 0.05 means that the difference was not of statistical significance.

2.1.3 Classification

For this project to be reproducible, some points will be specified of the learningprocedure and give a more detailed explanation of the algorithms used to train theprediction models. The focus of the experimental study were both on the classifierdesign, but also in the part called feature extractor, where it is decided which fea-tures in the text that represents it for the classifier. Apart from studying the feature

14

Page 15: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

extractor, a flat classification design like fig. 1 have been compared to a hierarchicaldesigned classification system (see figure 2).

Training

PredictionInput

Input

Feature Extractor

Feature Extractor

x_1x_2....x_n

x_1x_2....x_n

y_1y_2....y_n

Learning AlgorithmFeature Vector

Feature VectorClassifier

Class

Figure 1: General layout of the classification process with labeled data (supervised classi-fication) [BKL09b].

The classifiers was trained using a model called One Versus Rest (OVR), alsoknown as one-vs-all. Which means that for each class, a binary classifier will betrained to decide how likely it is that a given input belongs to a certain class or not.The probabilities will be compared and the highest scoring binary classifier will giveits resulting class as the result of the whole classifier. OVR can be thought of as asimple design, yet, the complexity grows with the number of classes since everyclass needs its own predictive model [Sl16d].

Each predictive model within the OVR, performs the prediction by using a linearprediction function:

f (x) = wT x+b (14)

The prediction function is learned during the training phase. In the ”learning algo-

15

Page 16: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

rithm” (figure 1) the model parameters w with dimension equal to the lenght of thefeature vector used, and constant b are sought after. In this project, this is done byusing an algorithm called Stochastic Gradient Descent (SGD).

Stochastic Gradient DescentThe type of problem that the SGD solves belongs to stochastic approximation algo-rithms in statistical learning. The goal is to find an expected loss of E( f (x)) that isas small as possible. Consider the expected loss to be given by:

E( f (x)) = L( f (X),Y ) (15)

Where X = {x1,x2, ...,xn},Y = {y1,y2, ...,yn} and L() measures the size of theerror according to the logistic loss. SGD’s solution to minimize E( f (x)) is to visitall examples and update the model parameters w according to that step’s logisticloss [Zha04].

For the specific SGD method used in this project, the update function for w lookslike:

wi+1 = wi−η(α∂R(wi)

∂wi+

∂L( f (xi),yi)

∂wi) (16)

The η defines at which rate the w values will be updated. α is a scaling factorfor the the regularization term, which is:

R(wi) =w2

i2

(17)

As mentioned, this is one implementation of SGD with attached with logistic re-gression as loss function and the regularization term R(wi) (eq. 17) [Sl16a].

Hierarchical modelThe hierarchical classifier in the experimental study (see figure 2), consisted of fiveclassifiers, one that classified the main class, and four classifiers that predicted thesubclass (see all classes in Appendix A). The design for each of the five classifierswas like the flat classifier design in figure 1. For testing and comparing the 2-levelhierarchical with the flat design, the same input were given, and the same outputshould be predicted.

16

Page 17: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

Level 1

Level 2

Businessand Logistics

Classifier

Health, Lifeand EarthClassifier

HumanSciencesClassifier

Tech andPhysical Sciences

Classifier

Main classclassifier

Input

Feature Extractorx_1x_2....x_n

Feature Vector

Class

Feature Extractor Feature Extractor Feature Extractor Feature Extractor

Feature Vector Feature Vector Feature Vector Feature Vector

Redirectorof input

Class Class Class Class

Figure 2: Design of the 2-level hierarchical classifying system. The Main class classifier’sprediction tells which subclass classifier should perform the 2:nd level classification.

Software Implementation of the ClassifiersThe implementation of the classifiers in the experimental study, was done withthe programming language Python3 with the machine learning toolkit Scikit-Learn

3All dependencies are presented in Appendix B

17

Page 18: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

(Sklearn). The OVR model used was Sklearns OneVsRestClassifier4, with SklearnsSGDclassifier5 as the estimator for each class in the OVR. For the SGD the lossfunction logistic regression was chosen, by the parameter SGDclassifier(loss="log"),other parameters (η, α) for the OVR and SGD modules were left as defaults. Thefeature vectorizers that were used were a CountVectorizer and a TfidfVectorizer.For the last experiment a DictVectorizer was used to transform a list of Pythondictionairies to a feature vector for classification.

2.1.4 Machine Architecture

The machine used for the project had 8 GigaByte (GB) of RAM, a Intel Core i7-3537U processor with 2 cores and 2 hyperthreads at 2.00GHz clock frequency. Allthe classifiers in the experiment was trained and tested was performed on the de-scribed computer.

4[Sl16d]5[Sl16a]

18

Page 19: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

3 ResultsIn this section the results from the experimental study will be presented. The resultsare divided in 4 experiments and each experiment is explained and visualized in itsown subsection.

3.1 Results from the Experimental StudyIn the experimental study, comparisons between different settings of features, forshort text classification. For all the experiments, flat against 2-level hierarchicaldesign of classifier was included. In the comparison, the metrics that was examinedwas precision, recall, ROC curve and its AOC.

The text features that were included in the experiments were divided into fourexperiments, these four were: Term Frequency times Inverse Document Frequency(TFIDF) normalized BOW features versus integer represented terms in BOW fea-tures, removing versus keeping stopwords together with lowercase conversion ver-sus no case conversion, order of n in n-grams of BOW features, adding featuresfor the number of characters and counting quotation marks in the texts. The resultsare displayed in the following subsections with graphs of ROC curves accompaniedwith AUC values with hypothesis testing and tables with the measures precision,recall and F1-score.

3.1.1 Comparing Weighting Methods of Terms in BOW features

TFIDF is a weighting method for terms in BOW features, TFIDF stands for TermFrequency times Inverse Document Frequency. It is calculated for each term in adocument and for all documents during the feature extraction phase. For a term t indocument d of dataset D:

T FIDF(t,d,D) = T F(t,d) · IDF(t,D) (18)

The term frequency T F(t,d) is the number of times the term t occurs in doc-ument d. In Sklearns TFIDF method, which was used in this experiment, the IDFpart is calculated as:

IDF(t,D) = log(1+ND)

1+DF(t,D))+1 (19)

Where Nd is the number of documents in dataset D, and DF(t,D) is the number ofdocuments in the dataset that contains the term t. After the TFIDF value is calculated

19

Page 20: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

for each term, the vectors containing the TFIDF values are normalized with theEuclidian norm [Sl16b]:

vnormalized =v√

v21 + v2

2 + ...+ v2n

(20)

Since this experiment was the first, no other settings for the BOW features wasadded. The experiment was done using the default parameters for the Sklearn mod-ules TfidfVectorizer and CountVectorizer. The TfidfVectorizer performedthe weighting method explained with formulas 18, 19 and 20. The CountVectorizerused a term count representation of BOW features.

Figure 3: ROC curves of the experiment on TFIDF weighted BOW features versus integerrepresented term occurrence in BOW features (term count). The comparison is made forboth flat and 2-level hierarchical designs.

In figure 3 it is shown that the flat designed classifier with TFIDF weighting gotthe highest score of AUC1 = 0.8803. Second was 2-level hierarchical with integer

20

Page 21: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

represented term occurrence at AUC2 = 0.8797. Third best result came from the flatdesign with integer represented term occurrence, with score AUC3 = 0.8777. Thez-test of the difference AUC1−AUC2 gave a p-value of p = 0.2212 (see figure 4),and the difference AUC1−AUC3 gave a p-value of p = 0.0092 (see figure 5).

Figure 4: AUC of Flat design with TFIDF weighting minus AUC of 2-level Hierarchicaldesign using integer represented term occurrence. Looking at the right side tail of the normaldistribution, the difference: AUC1−AUC2 = 0.0006 gives a p-value of 0.2212

21

Page 22: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

Figure 5: AUC of 2-level hierarchical designed minus flat designed classifier, both usinginteger represented term occurrence. Looking at the right side tail of the normal distribution,the difference: AUC2−AUC3 = 0.0020 gives a p-value of 0.0092

The z-test displayed in figure 4 shows that the difference between Flat designwith TFIDF and 2-level hierarchical with integer represented term occurrence hasp-value p = 0.2212 and thereby, too high to reject the null hypothesis. The otherz-test, seen in figure 5, shows that the difference between 2-level hierarchical de-signed and flat designed classifiers, both using integer represented term occurrence,had a difference which gave a p-value of p = 0.0092, low enough to reject the nullhypothesis.

Table 3: Results in precision, recall and F1-score for integer represented term occurrence(term count) versus TFIDF weighted BOW features. Both feature types was tested with flatand 2-level hierarchical designs.

Classifier Design BOW-feature representation Precision Recall F1-scoreFlat Term Count 0.7899 0.7908 0.7576Flat TFIDF weighting 0.7214 0.7067 0.62402-level hierarchical Term Count 0.7912 0.7915 0.75982-level hierarchical TFIDF weighting 0.7239 0.7073 0.6305

Unlike the ROC analysis in figure 3, the measures in table 3 showed highervalues for integer represented term occurrence. For the flat designed classifier, the

22

Page 23: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

F1-score showed 0.7576 for the integer represented term occurrence and 0.6240 forthe TFIDF weighting method, which was an increase of 21.4% in F1-score. The2-level hierarchical designed classifier’s F1-scores were: 0.7598, for the integerrepresented term occurrence and 0.6305 for the TFIDF weighting method, which isan increase of 20.5% in F1-score.

3.1.2 The Effect of Case conversion and Removal of Stop Words

This subsection contains the results of the experiment of converting the texts tolower case and removing predefined stop words. Stop words are common words thatrarely contribute to the meaning of texts and by removing them, the dimension of theBOW-feature vector is lowered [BKL09c]. In this experiment, NLTKs list of englishstop words were used. Prior to creating the feature vector, words in the text thatexists in the list of stop words is removed. Case conversion is also performed beforethe feature vector is produced. Going through the input and convert all letters tolowercase. The idea of lowercase conversion is to minimize the vocabulary, withoutcase conversion the same word can exist multiple times in the vocabulary, spelledwith different cased letters. This experiment obtained the result from the previousexperiment, therefore this test used integer represented term occurrence, with theCountVectorizer as feature vectorizer.

23

Page 24: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

Figure 6: ROC curves of the experiment on case conversion and stop word removal. Thedifferent approaches were tested on both flat and 2-level hierachical designs.

The results of the ROC analysis can be seen in figure 6. Three classifiers with 2-level hierarchical got the highest AUC scores. Lowercase conversion at AUC1 = 0.8826,stop word removal at AUC2 = 0.8829 and both lowercase conversion and stop wordremoval got AUC3 = 0.8825. Highest difference between the three was AUC2−AUC3 = 0.0004,which was less than both results’ standard error which shows 0.0006. A differenceof statistical significance between the three can therefore rejected.

The best AUC score for the flat designed classifier used lowercase conversion,with a score of AUC4 = 0.8797. The z-test performed on the difference AUC1−AUC4resulted with a p-value of p = 0.0003, a difference of statistical significance.

24

Page 25: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

Figure 7: AUC of 2-level hierarchical design minus AUC of flat design, both with low-ercase conversion. Looking at the right side tail of the normal distribution, the difference:AUC1−AUC4 = 0.0029 gives a p-value of 0.0003

For the flat designed classifier, it is shown in figure 8, that conversion to lowercase letters were better than no case conversion with statistical significance, the p-value was p = 0.0169.

Figure 8: AUC of lowercase conversion minus no case conversion, both with flat clas-sifier designs. Looking at the right side tail of the normal distribution, the difference:AUC1−AUC4 = 0.0018 gives a p-value of 0.0169

25

Page 26: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

In the case of the 2-level hierachical design for comparison of lower case con-version against no case conversion, the standard errors were the same as for the flatdesign (σ = 0.0006), but the difference between the AUC values was 0.0023, whichis larger than for the flat design 0.0018, therefore the hypothesis testing gives a p-value less than the flat case p < 0.0169 which establishes statistical significance.The figure 9 it is shown that removing stop words were better than keeping them forthe flat designed classifier, with a statistical significance with p-value p = 0.0226.

Figure 9: AUC of removing stop words minus keeping stop words, both with flat classifierdesigns. Looking at the right side tail of the normal distribution, the difference was: 0.0017giving a p-value of 0.0226

The difference of AUC values from removing stop words and keeping stop wordswas larger for the 2-level hierarchical classifier than the difference for the flat de-sign, which was of statistical significance. Sharing the standard error (σ = 0.0006),meaning that the p-value of the difference for the 2-level hierarchical design wasp < 0.0226, and also of statistical significance.

Table 4, declares values of precision, recall and F1-score from the experiment ofcase conversion and stop word removal.

26

Page 27: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

Table 4: Results in precision, recall and F1-score for case conversion and stop word re-moval. All feature settings were tested on both flat and 2-level hierarchical designs.

Classifier design Lowercase Stop words removed Precision Recall F1-scoreFlat No No 0.7900 0.7902 0.7568Flat Yes No 0.8013 0.8045 0.7765Flat Yes Yes 0.8013 0.8044 0.7762Flat No Yes 0.8013 0.8044 0.77622-level hierarchical No No 0.7920 0.7920 0.76052-level hierarchical Yes No 0.8035 0.8057 0.77902-level hierarchical Yes Yes 0.8039 0.8060 0.77952-level hierarchical No Yes 0.8032 0.8052 0.7783

It is shown in table 4 that the classifiers that used neither case conversion nor stopword removal got worse results than the others. For the flat design, the best scorewas received from lowercase conversion F1− score = 0.7765 which was 2.6% higherthan the F1-score of the classifier using no case conversion. The 2-level hierarchicaldesigned classifier with lowercase conversion had a 2.4% higher F1-score comparedto using no case conversion. When comparing the results of stop word removal, theflat designed classifier, where stop words were removed had a 2.6% higher F1-scorethan keeping stop words. The 2-level hierarchical case showed a similar increase of2.3%. The best performing classifier of this experiment in terms of F1-score wasthe 2-level hierarchical design with lower case conversion and stop word removal(F1− score = 0.7795).

3.1.3 Number of Words to use in Terms for BOW features

This experiment was examining a BOW features setting called n-grams. The numberof words in the terms can be adjusted for texts in the BOW features. The maximumnumber n, of words that a term can hold is referred to as n-grams.

For an example sentence: ”My dog scared them away.”, the unigram (n = 1)BOW features would contain the terms: ["my", "dog", "scared", "them", "away"].For bigram (n = 2) the BOW features would contain: ["my", "dog", "scared","them", "away", "my dog", "dog scared", "scared them", "them away"][BKL09d]. By extending the order of n-gram, more information from the text is col-lected. More information leads to a larger dimensioned feature vector, which meansheavier weight on memory and time consumed for prediction. For the training set,

27

Page 28: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

the size vocabulary of the flat designed classifier increased from 246169 for uni-gram, to 4300121 for bigram and 14402201 for trigram.

For this experiment the CountVectorizer (introduced in section 3.1.1) wasused with lowercase conversion and keeping stop words (see section 3.1.2). Theexperiment examined the effects when increasing the n-gram order from unigram,to bigram and trigram. Due to RAM limitations6, the size of the feature vector wasset to 5000000. The trigram feature vector’s hit the limited max features in thisexperiment.

Figure 10: ROC curves of the experiment on length of n-grams in BOW-features, uni-, bi-and trigrams. The three n-gram lengths were tested on both flat and 2-level hierarchicaldesigns.

In figure 10 the results of the ROC analysis of this experiment is shown. 2-levelhierarchical design with bigram and trigram got the highest AUC value, AUC1 = 0.8870,it follows that there is no reliable difference between those two results. Flat design

6See architecture of the machine used in the project in section 2.1.4

28

Page 29: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

with trigram was second at AUC2 = 0.8837, 2-level hierarchical design with uni-gram received a score at AUC3 = 0.8829. For each n-gram setting, the 2-level hier-archical design got higher AUC values. In the first z-test, displayed in figure 11, thedifference between flat and 2-level hierarchical designs for trigram BOW featureswas examined, giving p = 0.0004 statistically significant.

Figure 11: AUC of 2-level hierarchical design minus AUC of flat design, both with trigramBOW features. Looking at the right side tail of the normal distribution, the difference:AUC1−AUC2 = 0.0033 gives a p-value of 0.0002

The AUC for the flat designed classifier with bigrams, differed with 0.0035against the unigram. The difference showed to be of statistical significance, witha p-value of p = 0.0001 as seen in figure 12. The difference of the same case for the2-level hierarchical designed classifier was 0.0041 and their standard errors werelower, which means that that difference also was of statistical significance with ap-value less than 0.0001.

29

Page 30: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

Figure 12: AUC of bigram minus AUC of unigram, both with flat classifier design. Lookingat the right side tail of the normal distribution, the difference: AUC1−AUC3 = 0.0035, givesa p-value of p = 0.0001

The measures in precision, recall and F1-score, displayed in table 5, asserts thatfor each increase of order in n-gram the classifiers scored higher. For the flat design,extending from unigram to bigram gave a 1.6% higher F1-score, extending frombigram to trigram gave 0.07% higher F1-score. The same observation on 2-levelhierarchical design showed from unigram to bigram a 1.3% increase of F1-score,and from bigram to trigram a 0.02% increase of F1-score.

Table 5: Results in precision, recall and F1-score of the experiment on length of n-gramsin BOW-features. The three n-gram lengths were tested on both flat and 2-level hierarchicaldesigns.

Classifier Design n-gram Precision Recall F1-scoreFlat Unigram 0.8013 0.8041 0.7758Flat Bigram 0.8113 0.8138 0.7883Flat Trigram 0.8122 0.8143 0.78892- level hierarchical Unigram 0.8032 0.8054 0.77842- level hierarchical Bigram 0.8130 0.8135 0.78872- level hierarchical Trigram 0.8146 0.8148 0.7904

30

Page 31: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

3.1.4 Features for Counting Characters and Quotation Marks

This last experiment of the study was to test if adding features for the length of thetexts (i.e. how many characters a text consisted of) and how many quotation marksoccurred in the texts. With integer representations in the feature vector holding thecounted text stats added in this experiment. The experiment was done using low-ercase conversion and integer represented bigram occurrences as BOW features. ACountVectorizer, with parameters lowercase = True, ngram_range = (1,2)and the rest used the default parameters.

Figure 13: ROC curves of the experiment on adding features for the length of the texts andthe number of quotemarks in them. The different text stat features were tested on both flatand 2-level hierarchical designs.

In the ROC analysis in figure 13, the highest AUC value AUC1 = 8872, wasfrom 2-level hierarchical design without the text stat features examined in this ex-periment. At AUC2 = 0.8867, the 2-level hierarchical classifier with the feature forcounting quotemarks got the second highest AUC value. Third highest scores wasreceived from the flat design, where the classifier without the text stat features and

31

Page 32: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

the classifier using the feature for counting quotemarks got the same AUC value ofAUC3 = 0.8837.

In the z-test that examined if the difference AUC1−AUC2 was of statistical sig-nificance (see figure 14), the result showed a p− value = 0.2278, which indicatesthat the difference was not of statistical significance.

Figure 14: AUC of no text stats minus AUC of counting quotemarks, both with 2-levelhierarchical design. Looking at the right side tail of the normal distribution, the difference:AUC1−AUC2 = 0.0005 gave a p-value of 0.2778

The last z-test, examines the difference between flat and 2-level hierarchicalclassifier design, both without using text stat features introduced in this experiment.As seen in figure 15, a p-value of 0.0001 was obtained, which was far from thethreshold value and the difference was large enough to be of statistical significance.

32

Page 33: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

Figure 15: AUC of 2-level hierarchical design minus AUC of flat design, both without usingany text stat features for counting characters or quotemarks. Looking at the right side tail ofthe normal distribution, the difference: AUC1−AUC3 = 0.0035 gave a p-value of p=0.0001

For the second set of measures in this experiment, the flat classifier design re-ceived the highest precision, recall and F1-score, it had an F1-score of 0.8013. 2-level hierchical design with quotation marks recieved an F1-score of 0.7905, as seenin table 6. For the flat designed classifier, counting quotemarks increased the F1-score by 0.03%, respectively 0.02% for 2-level hierarchical design.

Table 6: Results in precision, recall and F1-score of the experiments on stop word removaland case conversion

Classifier design Text stats Precision Recall F1-scoreFlat - 0.8109 0.8126 0.7866Flat Text length 0.8026 0.8016 0.7762Flat Text length and quotation marks 0.8168 0.8199 0.8013Flat Quotation marks 0.8128 0.8147 0.78932-level hierarchical - 0.8136 0.8136 0.78882-level hierarchical Text length 0.8024 0.7984 0.76982-level hierarchical Text length and quotation marks 0.8051 0.8039 0.77792-level hierarchical Quotation marks 0.8152 0.8150 0.7905

Appendix C, shows the confusion matrix of the result from the flat designed

33

Page 34: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

classifier that got the best score in precision and recall. Looking at the confusionmatrix, one can get an overview of how the data was distributed over the classes andhow it was predicted.

34

Page 35: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

4 DiscussionThis thesis examined how a classifier should be designed and what decisions to takewhen choosing text features for short text classification for multiclass classificationpurposes. The experiments showed that a 2-level hierarchical designed classifierscored higher than the flat design in 11 out of 13 cases for both F1-score and AUCof ROC curve.

The highest AUC value was recieved from the 2-level hierarchical designed clas-sifier with feature settings: term count representation of bigram terms in BOW fea-tures; conversion to lowercase letters of input; keeping the stop words of the text; nofeatures added for counting characters or quotemarks in the texts. The above clas-sifier received an AUC value of 0.8872. The best F1-score, with F1 = 0.8013 wasachieved by a flat designed classifier. With feature settings: term count represen-tation of bigram terms in BOW features; conversion to lower case letters of input;keeping the stop words of the texts and adding features for number of characters andquotemarks in the texts.

The findings of this thesis acts as a opening on the subject of how to implementshort text classifiers. Since previous research on the subject was sparse, this thesiscontributes with scientific support on what previously were opinions and ideas.

One of the inspirations of the examined text features compared was a Kagglecompetition [KAG13] where most solutions used TFIDF weighted terms in BOWfeature vectors, but in this study, unexpectedly, it was more successful to use a termcount representation of the BOW features.

To increase the reliability of the results, the choice was to add the ROC curve tothe quality measures, rather than only using the standard measures of classification:precision, recall and F1-score. ROC analysis was brought in to balance to the results.However, the ROC curve is designed to analyze binary predictions, a modificationfor multiclass ROC analysis was used in this project. Since it is an untried method,it should be used along with other measures, to see if it is a credible measure methodfor multiclass classification.

4.1 ConclusionsThe 2-level hierarchical designed classifiers showed to give significantly better ROCcurves than the flat designed classifiers. With a p-value of p = 0.0006, the 2-level hi-erarchical designed classifier with the highest AUC value, 0.8872, showed to be sig-nificantly better than the flat designed classifier with the highest AUC value 0.8837.The 2-level hierarchical design gave better results on 11 out of 13 total implemented

35

Page 36: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

classifiers. For OneVersusRest classifiers, each class learns their own binary classi-fier, therefore the complexity is reduced in terms of training time and memory load,when using a hierarchical designed classifier.

On the comparison of weighting method of terms in BOW features, the ROCanalysis resulted with a flat designed classifier with TFIDF weighting receiving thebest AUC value of 0.8803. Yet, the 2-level hierarchical designed classifier with termcount got an AUC value of 0.8797. With a p-value of p = 0.2212, the two classifiersAUC results were too close to claim that there was a difference of statistical signif-icance. Examining the F1-scores, the term count recieved a 21.4% (for flat design,)and 20.5% (for 2-level hierarchical design,) better F1-score. The TFIDF weight-ing method has been shown successful for other text classification purposes, but thefindings of this thesis indicates that term count was more suitable for classificationof short texts.

Conversion to lowercase letters indicated better results than no case conversion.The ROC analysis showed that the flat designed classifier with lowercase conver-sion received higher AUC value than no case conversion. With a p-value of 0.0169the difference were of statistical significance. Also for the 2-level hierarchical de-sign lowercase conversion showed to give a higher AUC value than no case conver-sion value with statistical significance. Moving on to the F1-scores, the lowercaseconversion got a 2.6% and 2.4% better F1-score than no case conversion, for flatrespectively 2-level hierarchical classifier designs. Lowercase conversion is a sim-ple adjustment on the input, which according to this study, improves classificationperformance for short text classification.

The result indicates that removing stop words were more successful than keepingstop words. The AUC values from the ROC curves has shown, for both classifierdesigns, that removing stop words is better with statistical significance than keepingthe stop words, with a p-value of 0.0226 for the flat designed classifier, and lessthan 0.0226 for the 2-level hierarchical design. The F1-scores showed an increasewhen removing stop words in contrast to keeping stop words, an increase of 2.6%and 2.3% for flat respectively 2-level hierarchical designs. This result shows thateven for short texts, stop words do not bring any information that contributes tothe meaning of the texts, since removing the stop words gives better results for theclassifiers.

The n-gram length of bigram and trigram showed to get better results than un-igrams, for the length of terms in the BOW feautures. The AUC values from theROC analysis, pointed out that bigram BOW features was better than unigram withstatistical significance. The flat designed classifier with bigram BOW features dif-fered significantly from unigram BOW features, with a p-value of p = 0.0001, and

36

Page 37: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

the 2-level hierarchical difference of the same case gave a p-value less than 0.0001.The difference of AUC values, between bigram and trigram did not present a sta-tistical significance for any of the classifier designs. Looking at the F1-scores, theflat designed classifier with bigram BOW features against the unigram, increasedthe F1-score with 1.6%, when extending to trigram from bigram F1-score only in-creased with 0.07%. The 2-level hierarchical F1-score increased respectively with1.3% and 0.02%. As mentioned in the result (in section 3.1.3), the trigram hit thelimit of maximum BOW features of 5000000 terms in the vocabulary. Which shouldbe kept in mind, when examining the results of this experiment. Increasing n-gramin BOW features, was a successful way to add information from the short texts usedin this thesis.

Adding a feature for the number of characters in the texts, did not show to besuccessful. Both for AUC values and the F1-score, the result showed to be betterwhen not using the text length feature.

The feature for counting quotemarks in the texts indicated a minimal improve-ment of F1-score, than without counting quotemarks. For the AUC values though,no difference of statistical significance was identified, the flat designed classifier re-sulted with the same AUC value for not counting quotemarks and counting quotemarks.For the 2-level hierarchical classifier design, not counting quotemarks’ AUC valuewas higher than the value when counting the quotemarks, but with a p-value of0.2778 it was not of statistical significance. The F1 score increased with 0.03% forthe flat designed classifier and 0.02% for the 2-level hierarchical design.

The results from the experimental study were used for a case study with Thingmap,for mapping natural language queries to users. This resulted in an improvement overearlier solutions of their system.

4.2 Future WorkSince 2-level hierarchical design indicated to be successful, it would be interestingto see what higher order of hierarchy can do for not only short text classification,but also other classification tasks.

The dataset that was used was imbalanced and weighted heavy to the computerscience direction. It would be interesting to see similar experiments but with a bal-anced dataset. However, large datasets for supervised training are not commonlyfound. Another aspect that is interesting is to test methods that deals with imbal-anced datasets, as: undersampling, which removes samples of the majority classes;oversampling, where samples are generated for minority classes; or cost sensitivelearning, which evaluates the cost associated with misclassifying observations.

37

Page 38: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

This thesis contributed with a method for doing ROC analysis on multiclassclassifiers, for future research on multiclass classifiers. The ROC modification isrecommended as addition to other measures. Partly to establish the credibility ofthe ROC modification, but also to add a balanced measure for a study, for example,when dealing with imbalanced classes.

38

Page 39: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

5 References[Alp14] Ethem Alpaydin. Introduction to Machine Learning. MIT press, 3rd

edition, 2014.

[BKL09a] Steven Bird, Ewan Klein, and Edward Loper. Evaluation. In Julie Steele,editor, Natural Language Processing with Python, chapter 6.3, pages237–241. O’Reilly Media, Inc, Sebastopol, 2009.

[BKL09b] Steven Bird, Ewan Klein, and Edward Loper. Figure 6.1. In JulieSteele, editor, Natural Language Processing with Python, chapter 6.1,page 222. O’Reilly Media, Inc, Sebastopol, 2009.

[BKL09c] Steven Bird, Ewan Klein, and Edward Loper. Lexical resources. In JulieSteele, editor, Natural Language Processing with Python, chapter 2.4,pages 59–66. O’Reilly Media, Inc, Sebastopol, 2009.

[BKL09d] Steven Bird, Ewan Klein, and Edward Loper. N-gram tagging. In JulieSteele, editor, Natural Language Processing with Python, chapter 5.5,pages 202–208. O’Reilly Media, Inc, Sebastopol, 2009.

[BKL09e] Steven Bird, Ewan Klein, and Edward Loper. Preface. In Julie Steele,editor, Natural Language Processing with Python, chapter -, page ix.O’Reilly Media, Inc, Sebastopol, 2009.

[BKL09f] Steven Bird, Ewan Klein, and Edward Loper. Supervised classification.In Julie Steele, editor, Natural Language Processing with Python, chap-ter 6.1, pages 221–233. O’Reilly Media, Inc, Sebastopol, 2009.

[BMG10] Janez Brank, Dunja Mladenic, and Marko Grobelnik. Feature Construc-tion in Text Mining, pages 397–401. Springer US, Boston, MA, 2010.

[Bot10] Leon Bottou. Large-scale machine learning with stochastic gradientdescent. International Conference on Computational Statistic, page177–187, 2010.

[ES08] Bennett Eisenberg and Rosemary Sullivan. Why is the sum of inde-pendent normal random variables normal? Mathematics Magazine,81(5):362–366, 2008.

[Faw06] Tom Fawcett. An introduction to roc analysis. In Pattern RecognitionLetters, volume 27, page 861–874. Elsevier, Palo Alto, 2006.

39

Page 40: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

[Fla10] Peter A. Flach. Roc analysis. In Claude Sammut and Geoffrey I. Webb,editors, Encyclopedia of Machine Learning, pages 869–875. SpringerUS, Boston, MA, 2010.

[HM82] J A Hanley and B J McNeil. The meaning and use of the area under areceiver operating characteristic (roc) curve. Radiology, 143(1):29–36,1982. PMID: 7063747.

[HTF09] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elementsof Statistical Learning. Springer New York, 2009.

[KAG13] Facebook recruiting iii - keyword extraction. https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction, 2013.Visited 2017-01-09.

[Mar11] John Markoff. Computer wins on ‘jeopardy!’: Trivial, it’s not. The NewYork Times, 2011.

[MRS08] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze.Introduction to Information Retrieval. Cambridge University Press,Boston, MA, 2008.

[MS99] Christopher D Manning and Hinrich Schutze. Foundations of statisticalnatural language processing. MIT Press, 1999.

[SCK+16] Mark Sammons, Christos Christodoulopoulos, Parisa Kordjamshidi,Daniel Khashabi, Vivek Srikumar, Paul Vijayakumar, Mazin Bokhari,Xinbo Wu, and Dan Roth. Edison: Feature extraction for nlp, simpli-fied. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, ThierryDeclerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asun-cion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings ofthe Tenth International Conference on Language Resources and Evalua-tion (LREC 2016). European Language Resources Association (ELRA),2016.

[SE16a] Inc. Stack Exchange. About - stack exchange. http://stackexchange.com/about, 2016. Visited 2016-09-12.

[SE16b] Inc. Stack Exchange. Stack exchange data dump. https://archive.org/details/stackexchange, 2016. Visited 2016-09-12.

40

Page 41: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

[Sl16a] Scikit-learn. 1.5 stochastic gradient descent. http://scikit-learn.org/stable/modules/sgd.html, 2016. Visited 2017-01-17.

[Sl16b] Scikit-learn. Feature extraction. http://scikit-learn.org/stable/modules/feature_extraction.html, 2016. Visited 2017-01-05.

[Sl16c] Scikit-learn. sklearn.metrics.roc curve. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html,2016. Visited 2017-01-10.

[Sl16d] Scikit-learn. sklearn.multiclass.onevsrestclassifier. http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn-multiclass-onevsrestclassifier, 2016. Visited2017-01-29.

[Tin10] Kai Ming Ting. Confusion matrix. In Claude Sammut and Geof-frey I. Webb, editors, Encyclopedia of Machine Learning, pages 209–209. Springer US, Boston, MA, 2010.

[Tur50] Alan M. Turing. Computing machinery and intelligence. Mind,49:433–460, 1950.

[WZH] Ke Wang, Senqiang Zhou, and Yu He. Hierarchical classification ofreal life documents. In Proceedings of the 2001 SIAM InternationalConference on Data Mining, pages 1–16.

[Zha04] Tong Zhang. Solving large scale linear prediction problems usingstochastic gradient descent algorithms. In ICML 2004: PROCEEDINGSOF THE TWENTY-FIRST INTERNATIONAL CONFERENCE ON MA-CHINE LEARNING. OMNIPRESS, pages 919–926, 2004.

41

Page 42: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

AppendicesA Classes for the Classifier Designs

Flat classifier’s classesArtsBusiness and AdministrationComputer ScienceEducationEngineeringEnvironmentHealthHumanitiesJournalism and InformationLawLife SciencesMathematics and StatisticsPersonal Services and HobbiesPhysical SciencesSocial ScienceTechTransport Services

2-Level Hierarchical Classifier’s main classesBusiness and LogisticsHealth, Life and EarthHuman SciencesTech and Physical Sciences

Business and Logistics’s SubclassesBusiness and AdministrationTransport Services

Health, Life and Earth’s SubclassesEnvironmentHealthLife SciencesSocial Science

Human Sciences’ SubclassesArtsEducationHumanitiesJournalism and InformationLawPersonal Services and Hobbies

Tech and Physical Sciences’ SubclassesComputer ScienceEngineeringMathematics and StatisticsPhysical SciencesTech

42

Page 43: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

B Python Build DependenciesPython version 3.5.2 with following dependencies:Cython 0.24.1matplotlib 1.5.3nltk 3.2.1numpy 1.11.2scikit-learn 0.18scipy 0.18.1stop-words 2015.2.23.1

43

Page 44: DiVA portal1105415/FULLTEXT01.pdfSkynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A ... any language used by humans to communicate [BKL09e]. The

C Confusion Matrix

Con

fusi

on

Mat

rix

ofth

ere

sult

from

the

bes

tp

erfo

rmin

gfl

atd

esig

ned

clas

sifi

er.

Usi

ng

BO

W-f

eatu

res

wit

hte

rmco

unte

dbig

ram

s,lo

wer

case

conve

rsio

nan

dfe

atu

res

for

text

len

ght

and

nu

mb

erof

qu

otat

ion

mar

ks.

Th

eb

old

mark

edd

iago

nal

show

sth

eco

rrec

tp

red

icti

on

sof

the

test

.

Total

Arts

3731

315

4916

1974

00

4178

00

0996

1466

392

00

016087

Businessan

dAdministration

27064

6951

1814

00

641

00

01015

447

320

00

16184

Com

puterScience

521240

326454

43538

00

3165

00

08061

1559

262

00

0341374

Education

0387

919

660

00

0309

00

0208

8210

00

02575

Engineering

6149

6911

36117

00

516

00

01655

683

321

01

016362

Environ

ment

09

620

30

021

00

013

5119

00

0178

Health

09

112

03

00

660

00

3274

60

00

302

Human

ities

82390

5017

2327

00

23255

00

01353

987

127

01

131263

Jou

rnalism

andInform

ation

02

112

00

00

30

00

57

00

00

129

Law

4119

445

22

00

204

050

078

607

00

0971

LifeSciences

733

857

38

00

501

00

6432

295

141

00

02283

Mathem

aticsan

dStatistics

23337

16508

68195

00

2601

00

0132048

727

544

00

0153051

Personal

Services

andHob

bies

70684

13319

19294

00

2997

00

01464

18632

248

00

037727

PhysicalSciences

50160

3713

22390

00

1296

00

14574

840

7674

00

018720

Social

Science

237

231

53

00

211

00

1180

4218

00

0730

Tech

2136

1730

230

00

148

00

060

106

50

50

2143

TransportServices

366

442

015

00

214

00

0107

223

520

04

1126

44