COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input...

21
1 / 21 COMP8755 – Individual Computing Project Text Classification: Producing explainable and performant solutions using hybrid lexicon-based approaches Name: Qiutian Chen University ID: u5789678 Supervisors: Priscilla Kan John Kerry Tylor

Transcript of COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input...

Page 1: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

1 / 21

COMP8755 – Individual Computing Project

Text Classification: Producing explainable and

performant solutions using hybrid lexicon-based

approaches

Name: Qiutian Chen

University ID: u5789678

Supervisors: Priscilla Kan John

Kerry Tylor

Page 2: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

2 / 21

Introduction

Background

It is an interesting and challenging subject to automatically classify text documents in

document-oriented information systems from different areas such as libraries, social media

(Twitter, Blog, Facebook, reviews, etc.). It provides an efficient path for users to be able to

browse massive amount of data with ease (Clos, et al., 2017). Moreover, presenting why the

classifiers give such classified results and outcomes would be helpful to humans to

understand the mechanism as well as better evaluate whether such results are accurate as

expected. Besides, it can also convince humans with reasonable explanations to the

classifications. This report will demonstrate the research procedure and current achievement

of re-implementing Clos Et al.’s work (2017).

Project Motivation

There is a massive amount of data to be categorized for better indicating its features in real-

world cases. Therefore, automatically classifying helps people find their desired retrieved

information. The collected and categorized data could be useful in the business area and

helpful to most of the users. For instance, the categorized information of customers can help

business companies to learn users’ preferences and interests, according to which they can also

give recommendations to attract more consumers. Besides, the extracted features from

classifiers of those categorizing results present the reasons of the classifying, which helps

managers to be able to validate whether it is reasonable by humans’ perceptions. In the future,

more real-world industries require data and documents classifying in an intelligent way. In

this case, the efficient classifier(s) with extracted features will have a wide range of platform.

Page 3: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

3 / 21

Project Aim

In this project, the primary aims are as followed: 1) to study different kinds of classifiers,

which could be specialized in distinct input datatypes (they could be text, table entries, visual

information, voice information, etc). 2) Implementing the classifier (RALAXNET)

mentioned in Clos Et al.’s work, which is used to automatically categorize target data. The

datasets involved consider two particular senses of standards, stance and sentiment. Basing

on these two standards, there are respectively datasets for each one. Furthermore, the

significant features that have the most influence on classifying data should be extracted. After

the classifier model having been created (trained), the features extracted should be used to

mark the input text in a distinct way (colours, fonts, underlines, etc) if the features words

exist in that input text.

Literature Review

Classifiers

Classifiers construction is based upon the algorithm supporting them. Fernández-Delgado et

al. (2014) compared hundreds of classifiers with different datasets. They indicate that

different types of classifiers may have their own characteristics, and they have their target

areas/fields and datasets. Hence it is essential to examine the algorithm to ensure whether the

selected classifier is suitable for the data to be processed.

Lexicon-based Classifiers

overview

A lexicon-based classifier can be expressed as a dictionary-like table, which stores the

Page 4: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

4 / 21

corresponding weights of each word in it. The weights are used to scale, for example, how a

word affects a sentence, or in other words, how a term in a sequence influence the trend of

what class that entity belongs to.

There are three different techniques are generally referred to build a lexicon (Clos, et al.,

2017). Firstly, the traditional hand-crafted lexicons, which can be manually built by humans

with all included words rated by different individuals. Some samples of weighted words in

lexicon is listed below in table 1. The features in the table are distinctive with their sentiment,

and they are extracted from VaderSentiment (Hutto, C.J. & Gilbert, E.E., 2014), which is

referred in this project as the pure lexicon-based lexicon classifier.

Word

Mean

score

Standard

deviation

Manfully rated score from -4 to 4 by 10 individuals

Peace 2.5 1.0247 3 2 1 4 2 3 4 1 3 2

Poison -2.5 0.92195 -4 -3 -2 -4 -2 -2 -2 -3 -1 -2

Secure 1.4 0.4899 1 2 1 1 2 1 1 2 2 1

slavery -3.8 0.4 -4 -3 -4 -3 -4 -4 -4 -4 -4 -4

aboard 0.1 0.3 0 0 0 0 1 0 0 0 0 0

Table 1

In this table, the positive scores imply the trend the corresponding word to be positive

sentiment, and the negative scores on the contrary. Besides the means of ten scores to be used

as weights, the lexicon also provides standard deviation of each word to demonstrates the

scores fluctuation of ten individual, which indicates how reliable the mean-scores are to some

Page 5: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

5 / 21

extent. Secondly, the ontology-based lexicons, which is generated by propagating objects on

graph according to the known seed words and the synonymy, antonymy and hypernymy

external relationship connected to them (Esuli & Sebastiani, 2006). As this technique is

extremely foreign to this project, it will not be introduced deeper. Thirdly, the corpus statistic-

based lexicons, which mainly involve the conditional probability and pointwise mutual that

are frequently used in statistics area. However, this technique has higher chance on

overemphasizing the connections between terms and classes. At the simple level of statistics,

if a term, for instance, “September” occasionally occurs too often within a class, although this

term generally has no trend to be any class, it might mislead its weight in lexicon to lean to

that class.

Advantages and Limitations

The lexicon-based lexicon can be illustrated as an open-sourced dictionary that people are

able to check detailed information anytime. In addition, as its fundamental is human built, it

can also be easily modified and tuned when required. In other hand, building up a completed

lexicon is a huge work to do, as there are more than fifteen thousand of English words in

current use, and many of them has different meanings, it may consume a lot of time. Even

though a lexicon is successfully built, it could also take time and efforts to test and evaluate it

on types of data, and its accuracy cannot be guaranteed at the same time.

Classifiers Basing on Machine Learning

Overview

Machine learning is a tool that allows the system to automatically learn program from data

(Domingos & University of Washington, 2012). In last decade, the development of machine

Page 6: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

6 / 21

learning is has significantly accelerated, and it has also spread around more areas, while the

most widely used and mature area that involves machine learning is classification. The

classifier basing on machine learning is generally a system that extracts features from

multiple input samples and characterizes them to generate a model, which can be used to

predict the class of new incoming inputs.

Advantages and Limitations

The performance of machines like personal computers and laptops are rapidly increasing

nowadays, machine learning model can be supported easier. In this case, using machine to

learn a large amount data is much more efficient than manually build lexicon or make

statistic. Furthermore, it is possible to generate multiple models for distinct types of data

without spending much time, which is also flexible and practical, rather than build a lexicon

that might not be able to fit all the data. However, machine learning requires datasets that

were pre-labelled by the classes (details with be specified below), so that extra efforts are

required to construct the well-formatted datasets to feed machine learning model. Also,

overfitting is a problem that mostly happens to machine learning system, which means the

model emphases to fit the fed training data too much, leading to the model could fail a lot to

predict new input or classify data in low accuracy.

Datasets

As mentioned above, two classifying reference standards, stance and sentiments, are used in

fundamental reference work. Investigating the original datasets involved, datasets extracted

from Internet Argument Corpus (IAC) and CREATEDEBATE forum (CD) was set to stance

classifying and Amazon, Yelp and IMDb reviews (AYI) and Amazon user reviews (AMZ)

Page 7: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

7 / 21

was set to sentiment classifying. Considering the size and richness of datasets, the CD and

AYI datasets are applied to the project respectively for stance and sentiment classifying.

CD dataset includes the posts from users in the forum on different topics and issues, each

post has a labeled stance of the user, and AYI dataset includes reviews on three big online

platforms, Amazon: the online electronic commerce, Yelp: crowd-sourced reviews about local

restaurant/business, and IMDb: The Internet Movie Database.

Potential Tasks

In addition to automatically classifying, the technique can be also applied or extended to

other areas. In this case, there are also two potential tasks in this project: 1) emotion detection

from the text (Bandhakavi, et al., 2017) and 2) re-processing search engine results (Chen &

Dumais, 2000).

To apply the classifier to emotion detection, the datasets should be preliminarily in the format

of emotion categorized. As the emotion potentially refers both stance and sentiment, there

might have a probability that the classifier also works on emotion detection. Re-processing

the search engine results requires an interface, which works as a medium that allows the user

to interact with search results. The agent takes the user’s input as the keywords to be

retrieved via a search engine, and re-process the returned results with the classifier then

display them to users.

Page 8: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

8 / 21

Project Procedure and Changes Made

Initial steps

Pure Lexicon based classifier

As the classifier to be implemented is lexicon-based, it is necessary to study how a lexicon-

based estimate the outcome. In consideration of this, a pure lexicon-based classifier named

VaderSentiment (Hutto, C.J. & Gilbert, E.E., 2014) is referred. This classifier has a manually-

built lexicon with more than 7000 words. Each word was rated by 10 individuals, who give

the score from -4 to 4, and the mean of these scores will be the weight. Given an input text,

the classifier would give a result on the trend of being negative, being neutral, being positive,

and a compound score. The examples are shown in figure 1.

Figure 1

In addition, some attempts to modify the original code have been done. To display more

detailed weighting information of the classifier, how each word in the text affect the result is

also made to be printed as figure 2.

Page 9: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

9 / 21

Figure 2

Tool Selection

Machine learning is currently support in many development languages on different platforms.

As Python is a friendly Object-Oriented program language that has many machine learning

and data mining tools support, it is selected as the language to implement the target classifier

in this project. In Python, scikit-learn widely recommended tool to design and implement

Page 10: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

10 / 21

machine learning model with many built-in classifiers and algorithms. Therefore, it will be

preliminarily considered to be used in this project.

Baseline Classifiers Selection

Before implementing the target classifier, several popular and widely used classifiers are

involved as the start point to get started with. It helps to better understand the mechanism of

machine learning and basic rules need to take care in the implementation later.

The baseline classifiers selected are: Naïve Bayes, SVM, and Decision Tree.

Naïve Bayes

This classifier is a probabilistic classifier, which applies Bayes’ theorem. This classifier is

commonly used in the fields like email filtering and simple text classification (a few categories).

The core algorithm is to extract the word frequencies from input data as the features.

SVM

SVM is the abbreviation from support vector machines. It is a model to represent a map that

displays each sample as a point in space, and it uses gap (bound) to classify the points into

different classes.

Decision Tree

A decision tree can be imaged as a model with branches extended like real trees. It is a

flowchart-like graph and each node in decision tree is a test leading to the branches

representing the outcome of that test, which will eventually go to the end of the tree with

consequences (class label). Decision tree is more popular in game theory to analyse the

strategies and tactics.

Page 11: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

11 / 21

Precision Recall

Naïve Bayes 0.801 0.800

SVM 0.831 0.830

Decision Tree 0.640 0.640

Classifier Implementation

Structure description

The target classifier to be implemented is a classifier basing on machine learning. Whenever

a sequence of terms (can be a sentence, a review, or a post, etc.) is fed into the model, it is

split up into single ones and the summed weight (randomly generated weight for the terms

when initialized) of them combining with the scores from context window will be compared

to the class label, and the result can be applied to the terms’ weight for updating, which is

called regression. After all the inputs have been processed, a lexicon-like model will be

generated containing the weights to predict the class of new input according to the terms it

has. The graph of this structure is show in figure 3.

Figure 3

Page 12: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

12 / 21

Design and implementation

This structure requires convolutional neural network to configure the layers in this model,

while it is not supported in scikit-learn tool. Hence keras in python is involved as the tool to

implement this classifier.

a) Convert the input into vectors:

After loading all the data in the dataset, it is essential to convert them into a form that

can be fed into model in a clear and simplified format. In this case, all involved words

in input files constructs a vector with their own unique index representing it. Then

each input sequence of terms will be presented using the corresponding indices of

those terms. Similarly, the possible modifiers (adverbs and conjunctions) are also

converted into a vector with indices indicating them. Example vectors are showed in

figure 4.

Figure 4

Page 13: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

13 / 21

b) Train the model:

When a sentence is fed into the model as figure 5 for instance, “the dog runs quickly”,

the corresponding place in the matrix will be replaced by 1 for each term. Considering

the context window is size of 3, which means the 1 word right before current term and

the 1 word right after current term are included in. As the word “quickly” is the

modifier in this case, in the context window for “run” has a modifier in and the word

“quickly” itself.

Figure 5

c) In machine learning, there are some hyperparameters can be configured that might

have influence on the training result. By tuning these hyperparameters, the generated

model might be able to provide higher accuracy with new input. Some

hyperparameters used to tune model in this project will be listed below:

Page 14: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

14 / 21

1. Epochs: One Epoch is when an ENTIRE dataset is passed forward and

backward through the neural network only ONCE.

Multiple epochs generally allow the model fit more on the training data, and

reduce the influence of diverse data.

2. Batch size: Total number of training examples present in a single batch.

In most of cases, it is unpractical to pass entire dataset into neural network at

once. Hence the dataset need to be divided into several batch with the declared

batch size.

3. K-fold cross validation: k-fold cross validation is generally used to prevent

overfitting problem and provide the best model using the same training data.

For example, in 10-fold cross validation, the training data is split up into 10

groups. With one group of data held-out each time, there will be 10 different

models generated with the rest nine groups of data, and the held-out group will

be used to test the corresponding model by providing an accuracy score. The

mean of these 10 models’ scores is used to evaluate the model, and the model

with highest score is the best model to be used later.

Alternative solutions

During attempting to implement, I tried to configure the convolutional kernel, which is the

context window in structure to make it be able to consider ONLY the modifiers, but as the

time limitation, I was unable to complete this part in time. Then, an alternative solution to

this issue is applied. In convolutional layer, the kernel will be set to size of 3 rather than

manually configured. Therefore, the final version of the classifier model will be as followed:

Page 15: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

15 / 21

Layer 1: Embedding layer to map the terms indices, and the number limitation of features

is the top 400 frequent occurred words, which is set to prevent the noise while training

and improve the speed.

Layer 2: the convolutional layer with kernel size of 3 to apply the context window.

Layer 3: a hidden layer with 250 units of neurons.

Layer 4: output layer using SoftMax to decide the distribution.

Running Result

Figure 6 and 7 display the chart of the running result using 10-fold cross validation, with

hyperparameters as followed:

Epochs = 30 batch_size = 10 kernel_size = 3 k-fold = 10

Page 16: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

16 / 21

Figure 6

Page 17: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

17 / 21

Figure 7

One interesting thing is that when considering one of the 10 folds (figure 8 and figure 9), the

statistics is changing during 30 epochs. As the epochs increase, the model becomes more

fitted to the training data, but the loss in validation data keeps increasing, which indicates that

more epochs do not lead to better model as overfitting is always a problem if same data keeps

being used to train a model.

1.57

1.31 1.27

1.60

1.76

1.14

1.34

0.99

1.34 1.34 1.37

76.38% 76.38% 79.70% 77.12% 76.75% 78.81% 74.72%82.53% 81.78% 76.21% 78.04%

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

1 2 3 4 5 6 7 8 9 10

10-Fold Cross Validation Result

Loss Accuracy

Page 18: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

18 / 21

Figure 8

Figure 8

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Trend changes as epochs progress

train_loss train_accuracy validateion_loss validation_accuracy

Page 19: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

19 / 21

Conclusion

Using machine learning to implement classifier significantly improves the efficient to

generate a practical model to classify text. By comparing with the three popular classifiers, it

evidences that a fine-tuned model with the configured layers according to target datasets

could perform better than other existed widely used classifiers. However, to make model

perfect is currently impossible because: a) tuning a model into the best condition is a long-

term work as there could be multiple combinations takes O(n^n) time.

Page 20: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

20 / 21

Sentiment

Classifiers Accuracy

Naïve Bayes 76.70%

SVM 70.60%

Decision Tree 67.70%

Target Classifier 79.33%

Future Works

Configure Convolutional Layer

I have put the unfinished code for convolutional kernel configuration in the python file,

which would be the next task to do. By successfully implementing this part, the modifiers

effects to the terms might be more apparent.

For Search Engine Result

The current version of this classifier can already be used to re-process the search engine

result, which will be tested and evaluated manually. If possible, some attempt to implement a

user interface as an intermedia to interact between search engine result and client would also

be done.

Improve the Output

The output of this classifier is currently displayed in the command window, which is not

artistic enough. Make it more acceptable by humans worth reasonable efforts to do so, and as

described in the project aim, displaying the input sentence with distinct colours to indicates

how each term influence this sentence will be done in the future.

Page 21: COMP8755 – Individual Computing Project Text ... · which could be specialized in distinct input datatypes (they could be text, ... At the simple level of statistics, if a term,

21 / 21

References

Bandhakavi, A., Wiratunga, N. & Massie, S., 2017. Lexicon Generation for Emotion Detection from Text.

IEEE Intelligent System, 13 February, pp. 102 - 108.

Chen, H. & Dumais, S., 2000. Bringing order to the Web: automatically categorizing search results. CHI '00

Proceedings of the SIGCHI conference on Human Factors in Computing Systems, 1 April, pp. 145-152.

Clos, J., Wiratunga, N. & Massie, S., 2017. Towards Explainable Text Classification by Jointly Learning

Lexicon and, Aberdeen, United Kingdom: Robert Gordon University.

Domingos, P. & University of Washington, 2012. A few useful things to know about machine learning.

Communications of the ACM, 01 10, 55(10), pp. 78-87.

Esuli, A. & Sebastiani, F., 2006. SENTIWORDNET: A Publicly Available Lexical Resource. In Proceedings

of LREC, volume 6, pp. 417-422.

Fernandez-Delgado, M., Cernadas, E., Barro, S. & Amorim, D., 2014. Do we Need Hundreds of Classifiers

ito Solve Real World Classification Problems?. Journal of Machine Learning Research, 15 October, pp.

3133-3181.