Semi-supervised review tweet classification · 2020. 7. 10. · JALT VOF Semi-supervised review...
Transcript of Semi-supervised review tweet classification · 2020. 7. 10. · JALT VOF Semi-supervised review...
JALT VOF
Semi-supervised review tweet classification
Bachelor thesis
Gijs van der Voort (6191053)
6/12/2012
Supervised by: Manos Tsagkias and Leen Torenvliet
In this paper a method for classifying review tweets and a method for semi-automatically gathering training sets are proposed. Using classic text classification features, quality prediction features, Twitter specific features and linguistic features a Random Forest classifier is capable to correctly classify +/- 83% of the dataset. The proposed method for kick-starting uses very basic search queries to gather data. Too broad search queries consisting of a single hashtag do not work. Extending the single hashtag query with two subject specific keywords creates a training set with which +/- 70% of the original dataset can be correctly classified and results of which can be used to continue training.
1
Table of Contents 1 Introduction ..................................................................................................................................... 2
2 Related work ................................................................................................................................... 2
2.1 Text classification .................................................................................................................... 2
2.2 Social context .......................................................................................................................... 3
2.3 Aggregating information from social media ............................................................................ 4
2.4 Classification methods............................................................................................................. 4
3 Methods .......................................................................................................................................... 4
3.1 Building base dataset .............................................................................................................. 5
3.2 Preparing data for feature vector generation ......................................................................... 5
3.3 Calculating word weights ........................................................................................................ 6
3.4 Creating feature vectors for tweets ........................................................................................ 7
4 Experiments ................................................................................................................................... 10
4.1 Single feature classification ................................................................................................... 10
4.2 N-grams ................................................................................................................................. 13
4.3 Feature combinations ........................................................................................................... 14
4.4 Classification method ............................................................................................................ 15
4.5 Real life classification ............................................................................................................ 15
4.6 Semi-automated training ...................................................................................................... 17
5 Conclusion ..................................................................................................................................... 18
6 References ..................................................................................................................................... 19
2
1 Introduction A common problem that one is faced with when buying a new product is having to determine which
one of the many options is the option to buy. A reasonable solution for this problem would be to ask
a friend, a salesman, or somebody else you know, but are any of these people really able to give you
custom tailored advice? Your friend does not know anything about the type of product you are
interested in, we all know that the salesman is only thinking about his or her sale targets and the rest
of the people have no experience with the product whatsoever. If only we could ask somebody who
has actually experience with and knowledge about the product and does not have personal
intentions.
The internet has made it possible for everybody to exchange their experience with products and
services on a global scale. Discussing the quality of products, “reviewing”, has become so popular
that there are many online communities which solemnly consist of users exchanging reviews and
ratings of products and services. These communities have made it is possible for consumers to use
“the wisdom of crowds” [1] as a guide when buying a new product.
Having reviews on a website can significantly boost the value of a website, this has increased the
overall demand for reviews. Getting reviews on a website has proven to be difficult because people
only tend to write reviews for websites which are already supported by an active review community,
a “chicken or the egg” kind of dilemma. Are there other reviews sources which can be freely used?
Twitter is an online community where people can share short status updates, tweets [2], with the
rest of the world. Most of the tweets contain “pointless babble and self-promotion” [3], but people
also write about their experiences with people, events and with products and services: reviews.
Tweets posted to the public timeline, i.e. for everyone to see, are in the public domain and freely
available for all purposes and can therefore be used to enrich the contents of a website.
The Dutch yellow pages has asked Jalt, a Dutch social media consulting firm, to gather reviews of
restaurants, bars, diners, etc. from Twitter. Jalt requested to research the possibilities of
automatically gathering reviews from Twitter.
Manually filtering these reviews from all other tweets is either very time consuming: manually go
through all available tweets, or limiting in the amount useful reviews because of aggressive filtering.
To overcomes these problems, the following research questions are addressed in this paper:
Is it possible to gather reviews from Twitter using supervised classification?
o Which features are most suitable for this task?
Is it possible to do semi-supervised classification?
o With data gathered using a single hashtag [4]?
o With data gathered using a very basic search query?
2 Related work
2.1 Text classification Labeling documents is a topic that exists for a long time and it is considered “solved”. Most research
to text classification combines a classifier with a BOW (bag-of-words) model of the documents, most
3
of the time reducing the feature space with a form of word weighting and/or word stemming [5] [6]
[7] [8] [9].
Two widely used methods for word weighting are the LLR (log-likelihood ratio or G) test [10] and the
tf*idf ratio test [11]. Both of these tests calculate numerical statistics describing the discriminative
value of words between two corpora.
The log-likelihood ratio test is especially interesting because it is one of the few statistical tests that
does not assume that words are distributed normally:
“When comparing the rates of occurrence of rare events, the assumptions on which these tests are
based break down because texts are composed largely of such rare events. For example, simple word
counts made on moderate-sized corpus show that words that have a frequency of less than one in
50,000 words make up about 20-30% of typical English language news-wire reports. This 'rare'
quarter of English includes many of the content-bearing words and nearly all the technical jargon.”
[12]
Word stemming is the process of reducing words to their stem or base and has proven to be effective
in spam detection [13] and information retrieval [14] [15].
Both methods can be used in combination with N-grams of a corpus [16] [17]. An N-gram is a N
elements long sequence of elements from a given list of elements, in this case a corpus. The theory
behind N-grams is that they contain more context than single words. Because of the simplicity and
the scalability of the algorithm it is possible to easily increase the amount of context stored in the N-
grams by increasing the N-gram size.
2.2 Social context Because Twitter is known as a social media, it is interesting to see if other features are possible based
on the social context of a Tweet.
A topic that has been getting more attention because of the increasing popularity of social media is
quality assessment of social media content [18] [19] [20]. A technique that has been found effective
is exploiting the social graph of a user to predict the quality of the users’ content. By assuming four
social context consistency hypotheses [18] in combination with the users’ social graph, unsupervised
quality prediction is possible. A different approach to predict content quality is to use the authority
of a user as given by the community and the feedback that has been given on a question or answer
[19].
Both methods are can be very effective and would have been used if not for the lack of social context
and interaction [21] [22] on Twitter. It seems that undirected information sharing is the most
common use of Twitter and people mentioning their experience with a product or services very
rarely have any connection at all.
Other methods of quality prediction without the need of social context are available. A commonly
used metric for quality prediction is the lexical density metric [23] [24]. This metric indicates the
amount of information inside a sentence or short piece of text by calculating the distance between
non-stop words. The underlying assumption is that people that have something specific to say try to
4
minimize the amount of unnecessary words. This assumption is especially interesting because of the
limited length of tweets that makes every wasted character count.
The way people write on social media is different from normal written text, content from social
media is full of slang and colloquialism. A typical example of a message from a very excited writer on
twitter:
“OMG!!!!! I saw many countries trend #Happy7thSS501
OMG!! FIGHTING!! we will be No1 for sure!!!! :D” [25]
Features that are interesting to explore are the number of repeating characters, the number of
uppercase/lowercase characters and the number of punctuation characters [26].
2.3 Aggregating information from social media Gathering statistics from social media by aggregating content is another field that is getting increased
attention because of the growth of social media. By harnessing the wisdom of crowds it has been
proven possible to predict the rating of movies [27], predict elections [28] and to train a profitable
stock market trader [29]. Trying to distil information from a single corpus tends to be harder [30] and
has room for further research [31].
Aggregating information from social media is a topic that will not be discussed in this paper but it
gives insight in possible features that can be used for classification. Features that can be used from
tweets are:
The number of mentions [32] in a tweet
The number of hashtags in a tweet
The number of urls in a tweet
The number of geo-locations in a tweet
2.4 Classification methods To be able to run the classification tests, some sort of classification environment is needed. For most
programming languages there are classification libraries available but most of these libraries seem to
focus on either a single classification method or on application integration instead of experimenting.
To be able to switch between classification methods instantly when needed and not having to build a
new experimentation environment, an existing classification experiment environment has been used:
Weka.
Weka is an experimentation environment for data mining tasks. It has support for a large number of
different classification methods like SVM, neural networks, trees, Bayesian models and many more,
but also has support for data clustering, meta-classification, association and feature selection. All of
these methods and algorithms are available through the open Java API and command-line
environment but is also available through a GUI environment which can help with importing datasets
from all sorts of file formats and database providers and has all sorts of visualization like feature
scatterplots and feature distributions [33] [34].
3 Methods The used methodology consists out of the following steps:
5
1. Building base dataset
2. Data cleaning, parsing and normalizing
3. Preparing data for feature vector generation
4. Performance testing of classifier on base dataset
5. Semi-automatic collecting of training data
6. Comparing the semi-automatic classifier with the base classifier
Each of these steps will be discussed in more detail in the following subsections.
3.1 Building base dataset The dataset used for the research consists of tweets collected over a period of three months. To
collect the tweets, the Twitter search API [35] has been used. Although freely available and fast, it
has four limitations:
It is impossible to get all messages in a given language
The search API is rate limited to 150 requests per hour
The allowed complexity of search queries is limited
You can only search a limited time back in history
To be able to overcome these problems a tweet collecting system has been build.
At the core of the tweet collecting system lies the assumption that every review contains words that
have a judging tone. By using as many as possible of these words it is safe to assume that all Dutch
reviews are in the collected dataset.
Because the allowed complexity of search queries is limited, using one big search query is impossible.
However, splitting of the queries is easy because the search queries consist of combinations of
keywords.
By assigning every query a percentage of the maximum allowed number of requests, the API is used
in the most optimal way without risking penalties. This also solves the problem of the limited search
history by collecting the tweets as fast as possible.
Collecting tweets using this system over a period of twelve weeks resulted in +/- 15 million tweets
and after manually filtering this resulted in 6512 review tweets. Not every review about a place
where one could eat or drink has been accepted to this set. The Dutch yellow pages has no interest in
reviews about large chains like McDonald’s and Kentucky Fried Chicken, so these reviews have been
discarded.
For classification experiments an evenly distributed dataset is required, this means that for every
review a non-review must be present in the dataset. Therefore 6512 non-review tweets have been
added to the dataset.
3.2 Preparing data for feature vector generation Before feature vectors can be generated from the tweets the data has to be prepared. For every
collected tweet a record containing the following elements is generated:
The message of the tweet, with newlines replaced by spaces
6
A list of the normalized words of the message
The post date and time of the tweet
A flag indicating if the tweet is a review
The output of a lexical analyzer
The output of a location detection service
The replacement of newlines by spaces is necessary because the lexical analyzer used in this research
interpreted a newline as the end of a document. Frog [36], the used lexical analyzer, is a command-
line tool that is able to tokenize and lemmatize Dutch documents.
To create the list of normalized words the entire message should be converted to the same letter
case and split into words. In this paper a word is defined by the following regular expression, which is
suitable for the Dutch language:
\w+
To search for location information embedded in a message, a location detection service is necessary;
for this paper Yahoo! Placemaker [37] has been used. This service is capable of identifying places in
unstructured data and returning geographic metadata.
Collecting and generating the results of these two services for 13000 records is a slow process.
Therefore the results of both services are stored in the record as is, so information will not be lost
because of preliminary filtering, and collecting this data only has to be done only once.
3.3 Calculating word weights Before being able to create the feature vectors the word weights as defined by both the log-
likelihood ratio test and the tf*idf test for all words in the training set must be calculated. The
training set consists of an evenly distributed review/non-review fraction of the original set which may
not be used for training or testing the classifier.
By joining all tweets of a given class, two large corpora are created. From the two corpora are all
possible n-grams generated, including overlap. To illustrate this, the following example shows the
conversion of a string to all 2-gram combinations:
D = “apples are rather tasty”
2-gram(D) = {“apples are”, “are rather”, “rather tasty”}
Using the N-grams of both corpora the log-likelihood ratio and tf*idf weight can be calculated. The
tf*idf weight is calculated using the following formula:
( ) ( ) ( )
( ) { }
( )
{ }
Where is all documents combined, the number of documents in , a given term and a
document from . For this report consists of both the review and non-review corpus and is the
review corpus.
7
For calculating the log-likelihood ratio weight of a term, first the following table is constructed:
Corpus 1 Corpus 2 Total
Frequency of word
Frequency of other words Total
Table 1
The values a and b are called the observed values, the values c and d are the total number of words
in their respective corpus. For both corpora the expected value can now be calculated using the table
above and the following formulas:
( )
( )
Using the expected values for term in both corpora, the log-likelihood ratio weight can be
calculated:
( ) (
)
From both sets the X most high ranking N-grams and corresponding values are selected, where X is a
number for suitable for the test being executed. The selected words are used for building the feature
vector.
3.4 Creating feature vectors for tweets After having normalized the data and calculating the word weights it is time to create the actual
feature vectors. In this section all the features will be defined and explained where necessary. When
describing the features, “words” refers to the words as found in de data normalization phase and
“message” refers to the content of a tweet in its original form.
3.4.1 Post time ratio
The tweet post time as time of day in milliseconds divided by the total amount of milliseconds in 24
hours.
3.4.2 Message length
The number of characters in the message. This includes all types of characters.
3.4.3 Number of words
The words counted are the words that are found in the data preparation phase.
3.4.4 Unique words ratio ( )
8
The unique words ratio is calculated by dividing the amount of unique words by the total number of
words.
3.4.5 Mentions ratio { }
The number of mentions in a message divided by the number of words of a message. Mentions are
defined by the regular expression:
(?:\s|^)@\w+
Which translates to: whitespace or begin of line followed by a at (@) sign and word characters.
3.4.6 Hashtags ratio { }
The number of hashtags in a message divided by the number of words of a message. Mentions are
defined by the regular expression:
(?:\s|^)#\w+
Which translates to: whitespace or begin of line followed by a number sign and word characters.
3.4.7 URLs ratio { }
The number of URLs in a message divided by the number of words of a message. Mentions are
defined by the regular expression:
(?:\s|^)(http://|https://)(\S+)
Which translates to: whitespace or begin of line followed by “http://” or “https://” and one or more
non whitespace characters.
3.4.8 Location name ratio { }
The number of location names divided by the number of words in a message. The number of location
names is found by counting the results of the location detection service.
3.4.9 Char class ratios { }
The number of characters from a given class in a message divided by length of the message. The
character sets used are the following sets, as defined by the Python string constants [38]:
Digits
9
Whitespace
Punctuation
Uppercase
Lowercase
3.4.10 Repeating char ratio
∑
The number of repeating, consecutive characters in a message, divided by the total number of
characters in the message.
3.4.11 Part of speech category ratio
{ }
The number of words of a given category in a message divided by the number of words of the
message. The parts of speech categories used the following:
Nouns
Verbs
Articles
Numerals
Prepositions
Adjectives
Adverbs
Conjunctions
The number of occurrences of each class can be extracted from the lexical analyzer output.
3.4.12 Lexical density
( )
∑
( ) ( )
( )
( ) ( )
( ) ( ) ( )
Where is the message and the set of not-stop words. Summing over all keywords: the sum of
the weights of two keywords divided by the distances squared. By dividing the sum of the weights by
the distance squared exponentially decreases the value by increase of distance. The result is then
normalized to the number of keywords. The weight of the keywords is defined by the log-likelihood
ratio of the keyword.
3.4.13 tf*idf score
( ) ∑ (
)
10
The sum of all tf*idf values of every word in message .
3.4.14 Log-likelihood word ratios
After calculating the log-likelihood ratios, the top N words are selected. For every word in the
occurrence ratio is calculated.
( )
4 Experiments The experiments are set up to evaluate the performance of the classifier. The performance is defined
as the percentage correctly classified tweets. This means that a performance of 50% indicates that
the same results can be obtained by randomly classifying 50% of the tweets as review and 50% as
non-review.
4.1 Single feature classification Because of the diversity of the research areas from which features are extracted, testing the
performance of each of these features individually gives a clear insight in the behavior of each of
these features, i.e. the ability to classify reviews based on only that feature.
The classification method used in this experiment is the Random Forest, using 10 trees.
4.1.1 Tweet properties
Figure 1
None of the features seem very discriminative between the two classes. The post time feature
preforms the best with 57%. The length/words and unique words ratio were expected to perform
better because of the results seen in quality prediction research. A possible explanation is that the
limited length of tweets limits the possible variation in length.
57,0639
52,6413 55,5666 54,3535
30
40
50
60
70
Post time Length Words Unique words ratio
Pre
form
ance
Tweet properties
11
4.1.2 Textual statistics
Figure 2
The textual features have very variable performance. The punctuation and repeating char ratio were
expected to perform better based on earlier research. The other features, implemented because of
the little extra needed code, have an unexpected high performance, specifically the whitespace,
letter and uppercase ratio. The reason that these two features perform relatively well is that reviews
possibly tend to be more coherent messages without excessive uppercase usage, i.e. shouting.
4.1.3 Twitter tokens
Figure 3
Both the mentions and the hashtag ratios seem to preform rather well. The URLs ratio is almost non
discriminative. A reference to the service or product that gets discussed is considered normal on
other media and would be expected on Twitter. It looks like people are not aware of the fact that
they write a review , assuming that the reader understands it or looks it up by himself.
52,2113 53,6318
60,4883 58,5458
51,0135
56,8796
52,0117
30
40
50
60
70
Punctuationratio
Digits ratio Whitespaceratio
Uppercaseratio
Lowecaseratio
Letter ratio Repeatingratio
Pre
form
ance
Textual statistics
55,0138
61,4711
51,6278
30
40
50
60
70
Mentions ratio Hashtags ratio Urls ratio
Pre
form
ance
Twitter tokens
12
4.1.4 POS
Figure 4
The POS features have a wide range of different results. The prepositions ratio is the best preforming
feature. Reviews about diners, bars, etc. always contain a reference to where the dinner/etc. took
place. Those references are most of the time preceded by one of the following Dutch prepositions:
“in”, “bij”. Twitter reviews also contain references to the people that have accompanied the
reviewer. The references to these people are often preceded by the Dutch preposition: “met”. The
performance of the verbs and the adjectives are expected because having dinner or having a drink is
an activity and you need an adjective to describe the experience.
4.1.5 Other
Figure 5
These results are unexpected. Lexical density has proven to be a good indication of quality of content
it seems that lexical density, like unique ratio, suffers from the limited message length of tweets. The
idf*tf feature is not as effective as expected this may be because of the summing of all the tf*idf
values. The location ratio isn’t preforming that well either, like the URLs feature, it seems that people
53,317 57,1714
53,125 54,699
63,9128 56,8796 54,8833 53,2095
30
40
50
60
70
Pre
form
ance
POS
54,0831 53,0467 52,3078
30
40
50
60
70
Location ratio Lexical density idf*tf
Pre
form
ance
Other
13
are not referencing the establishment as good as one might expect from a review and very rarely
include the location of the establishment. A different reason for the performance of the location
ratio could be that the location name services used has difficulty with the Dutch language and can
only filter out English location names.
4.1.6 LLR
Figure 6
The LLR ratio feature has been tested on a range of combinations of the following two variables:
The number of words taken as feature
The percentage of the dataset used for determining the LLR values of words
A general trend in all percentages of training data is the peak at the lowest number of words and the
peak around the 200 words. It seems that using more than 200 words decreases the performance
across all sizes of training data. The peak at the beginning can possibly be explained by the small
feature space and the high discriminative value of the top 20 words. The immediate decrease of
performance between 20 and 80 words is more difficult to explain. It is possible that because the
most discriminative words are in the top 20, adding more words with possibly exponentially less
discriminative value decreases the value of the overall performance. By adding more and more
words, more complex models of review/non reviews can be build, resulting in overall improving
performance. The decrease in performance after 200 words is possibly because of the very limited
discriminative value of those extra words.
4.2 N-grams To see how N-grams can improve the performance of the LLR feature, this experiment compares the
performance of different N-gram sizes. The used number of N-grams is 200 and 20% training data has
been used for LLR ratio calculation.
77
78
79
80
81
20 40 60 80 100 120 140 160 180 200 220
Pre
form
ance
Number of words
LLR
10%
20%
30%
40%
50%
14
Figure 7
We can see from the results in Figure 7 that using N-grams does not improve the classification
performance. The bi-grams seems to very slightly outperform the uni-gram, e.g. the normal LLR
feature, but not in a meaningful way. It is possible that the limited length of the tweets play a huge
part in the seen performance. By using a bi-gram, the amount of possible combinations is squared
compared to the uni-grams which results in a more limited amount of shared word combinations
across a single corpus.
4.3 Feature combinations Now we have seen how individual features preform, it is interesting to see how the groups preform
when combining their features. In this experiment we will also see how the classifier preforms when
combining all features from all groups. Although the LLR feature will not be combined in any way, it is
interesting to compare the LLR feature with all the other groups of features.
Figure 8
30
40
50
60
70
80
1-gram 2-gram 3-gram 4-gram 5-gram
Pre
form
ance
N-grams
61,7995 62,905 60,2641
63,7494
53,6417
79,5893 81,2014
40
50
60
70
80
Properties Statistics Tokens POS Other LLR All
Pre
form
ance
Combinations
15
In these results it is clearly visible that even though single features do not perform well, increasing
the feature space gives the classifier the room to find more complex patterns. This goes for every
group except “Other”. It seems that even when combining the three very low preforming features
from this group does not give anything for the classifier to work with. One interesting detail that
stands out is that the LLR feature preforms only two percent less than everything combined,
suggesting that the classification that can be done using the other groups can mostly be done by the
LLR feature.
4.4 Classification method The classification method used in the previous experiments is the Random Forest. A more common
used classifier for text classification is the Naïve Bayes classifier. There are many more classifiers
available and the type of classifier can have significant effect on the overall results. Therefore it is
interesting to see how the different classifiers preform.
Figure 9
Because most classification methods have one or more parameters, the results in Figure 9 are the
results with the best preforming combination of parameters. Random Forest preforms the best of all
the classification methods.
4.5 Real life classification The dataset used in previous experiments is data gathered over the course of 12 weeks, from 46 in
2011 to week 6 in 2012. The following two experiments are meant to find out how time affects the
performance of the classifier
4.5.1.1 Preceding week based classification
This experiment will use week for training and week for LLR calculation for classifying
week . The idea behind this experiment is that the correlation in terms of content may be higher
between two consecutive weeks. When a big event occurs and people mentions these events in their
messages, it is possible that by using the preceding week for classification the classifier can take
advantage of the correlation.
78,7736 74,7049
67,9828
83,799
72,0294
50
60
70
80
90
100
C4.5 Naive Bayes K* Random Forest SVM
Pre
form
ance
Classification methods
16
Figure 10
The results in Figure 10 are very irregular. From the start of week zero, there seems to be a promising
growth in performance, only to fall back after week four. It is possible that because of the actual two
preceding weeks used for training, the scope is too large for actually using events as extra
information. A shorter period may be better to use, but the limited amount of reviews that are
gathered per week (+/- 200) already limits the performance.
4.5.2 Start of a new subject
Classification has only been done for reviews about restaurants, bars etc. but other clients of Jalt
already have expressed their interest in reviews about other subjects. To find out how the classifier
preforms during the start of a new subject, this experiment will classify week by using all preceding
weeks of week for training.
Figure 11
The results in Figure 11 are how one could expect from training with an increasing training set size.
The first couple of weeks shows a very irregular pattern but overall an increasing line until it
stabilizes around 83%.
75
80
85
90
48 49 50 51 52 0 1 2 3 4 5 6
Pre
form
ance
Week number
Preceding week as training set
70
75
80
85
90
46 47 48 49 50 51 52 0 1 2 3 4 5
Pre
form
ance
Week
All preceding weeks a training set
17
4.6 Semi-automated training In these experiments the possibility of semi-automated training has been explored. Instead of using a
part of the original dataset, new data to use for training has been collected. A classifier is trained
with this data and tested on the original dataset. If these experiments prove to be successful, the
required effort to train the classifier will be dramatically reduced because no manual classification is
necessary when starting with a new review subject. In these experiments, both the number of LLR
terms and the percentage used for LLR term generation has been taken in account and tested in
various combinations to find the best possible performance.
4.6.1 Single hashtag
Because Twitter uses hashtags to group content together, using a single hashtag for gathering
trainindata would be ideal. The first hashtag that had been taken into consideration was “#review”.
For the English language this would have been a very usable hashtag but unfortunately unusable for
Dutch content. A common Dutch hashtag that is often used for recommending other is “#aanrader”.
Using only “#aanrader” as search query, a new datasets has been gatherd and used for classification.
Figure 12
Looking at the results in Figure 12 we can see that the performance overall just barely exceeds 53%
percent. Looking at the tweets gathered using this query we see that recommendations for
restaurants, bars, etc. only make up a tiny fraction of the entire dataset probably making the
classifier to broad for the specific type of reviews required to classify.
4.6.2 Single hashtag extended
Because of the poor results of the single hashtag, it is interesting to see if by extending the single
hashtag search query with a limited amount of extra keywords, we can increase the performance of
the classifier. The number of keywords will be limited to two, because the idea of semi-automated
classification is that starting a new classifier for a new subject should take the least amount of
manual input.
Most of the tweets are about eating (it is hard to review the quality of drinks), the keywords chosen
are “eten” and “gegeten” resulting in the search query:
#aanrader AND eten OR gegeten
50
51
52
53
54
10 50 100 150 200
Pre
form
ance
LLR terms
#aanrader
10%
20%
30%
40%
50%
18
Figure 13
The results of the extended search query outperforms the single hashtag query significantly. Again
we see that 200 LLR terms preforms the best. The best performance is seems to be around the 20
and 30 percent of LLR training data. A possible explanation for this being in the lower sizes has
probably to do with the size of the dataset. Using the extended search query only around 6 tweets
per day were found. The set used in this experiment therefore only consists of 200 tweets in contrast
to the 2000 tweets in the single hash tag dataset.
5 Conclusion In this thesis a method for classifying review tweets has been discussed using various different types
of features like the textual contents, special Twitter unique tokens and parts of speech. With such a
system it is possible to automatically filter out reviews about in a certain category which webmasters
can use to enrich their website content. A method for kick-starting the classifier has also been
proposed so that building a classifier for a new category takes very little human effort.
Using the proposed method it is possible to classify review/non-review tweets with 83% certainty.
The LLR method has proven to be most useful in classifying, but only when combined with the other
features it was possible to get beyond the 80%. N-gramming of the words in a message has no effect
on the performance of the classifier.
The best classification method for this problem is the Random Forest. Although other classification
methods can possibly be improved by feature selection, the Radom Forest has the advantage that
training can be very fast in comparison to other methods like SVM.
Kick-starting the classification process by using very simple search queries has also been proven
possible. The performance of the classifier drops when a search query is to broad, but making it
more specific using only two extra keywords can significantly improve performance, even when only
a training set of 200 tweets has been gathered.
Further possible research would be to see what effect the size of the dataset has on the performance
of the kick started classifier. The original dataset is specifically tailored to the wishes of the Dutch
yellow pages, e.g. not having big chains. It can be interesting to see how the classifier preforms when
62
64
66
68
70
72
10 50 100 150 200
Pre
form
ance
LLR terms
#aanrader AND eten OR gegeten
10%
20%
30%
40%
50%
19
big chains are annotated as review instead of non-review. The influence of time on classification has
been tested over a period of three months. It would be interesting to do experiments over a longer
period of time with longer intervals to see if there are seasonal events etc. or with shorter intervals
to see the influence of more short termed events.
6 References
[1] J. Surowiecki, The Wisdom of Crowds: Why the Many are Smarter Than the Few and how Collective Wisdom Shapes Business, Economies, Societies, and Nations, Doubleday, 2004.
[2] "About Tweets (Twitter Updates)," Twitter INC., [Online]. Available: http://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/127856-about-tweets-twitter-updates. [Accessed 06 04 2012].
[3] R. Kelly, "Twitter Study - August 2009," Pear Analytics, San Antonio, 2009.
[4] Twitter INC., "What Are Hashtags ("#" Symbols)?," Twitter INC., [Online]. Available: http://support.twitter.com/articles/49309-what-are-hashtags-symbols. [Accessed 6 June 2012].
[5] T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many relevant Features," Universiät Dortmunt, Dortmunt, 1998.
[6] I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras and C. Spyropoulos, "An evaluation of naive bayesian anti-spam filtering," Arxiv, 2000.
[7] L. Manevitz and M. Yousef, "One-class SVMs for document classification," The Journal of Machine Learning Research, vol. 2, pp. 139-154, 2002.
[8] H. Ragas and C. Koster, "Four text classification algorithms compared on a Dutch corpus," Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 369-370, 1998.
[9] Y. Matsuo and M. Ishizuka, "Keyword extraction from a single document using word co-occurrence statistical information," International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157-170, 2004.
[10] M. Weintraub, "LVCSR log-likelihood ratio scoring for keyword spotting," in Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, 1995.
[11] T. Tokunaga and I. Makoto, "Text categorization based on weighted inverse document frequency," in Special Interest Groups and Information Process Society of Japan (SIG-IPSJ), 1994.
[12] T. Dunning, "Accurate methods for the statistics of surprise and coincidence," Computational linguistics, vol. 19, pp. 61-74, 1993.
[13] S. Ahmed and F. Mithun, "Word stemming to enhance spam filtering," in the Conference on Email and Anti-Spam (CEAS’04), 2004.
20
[14] T. Sembok, "Word Stemming Algorithms and Retrieval Effectiveness in Malay and Arabic Documents Retrieval Systems," in Proceeding of World Academy of Science, Engineering and Technology, 2005.
[15] J. Carlberger, H. Dalianis, M. Hassel and O. Knutsson, "Improving precision in information retrieval for Swedish using stemming," in the Proceedings of NODALIDA, 2001.
[16] J. Fürnkranz, "A study using n-gram features for text categorization," Austrian Research Institute for Artifical Intelligence, 1998.
[17] W. Cavnar and J. Trenkle, "N-gram-based text categorization," Ann Arbor MI, vol. 48113, no. 2, pp. 161-175, 1994.
[18] Y. Lu, P. Tsaparas, A. Ntoulas and L. Polanyi, "Exploiting social context for review quality prediction," Proceedings of the 19th international conference on World wide web, pp. 691--700, 2010.
[19] J. Bian, Y. Liu, D. Zhou, E. Agichtein and H. Zha, "Learning to recognize reliable users and content in social media with coupled mutual reinforcement," in Proceedings of the 18th international conference on World wide web, 2009.
[20] M. Bosma, E. Meij and W. Weerkamp, "A Framework for Unsupervised Spam Detection in Social Networking Sites".
[21] H. Kwak, C. Lee, H. Park and S. Moon, "What is Twitter, a social network or a news media?," in Proceedings of the 19th international conference on World wide web, 2010.
[22] S. Perez, "Twitter is NOT a Social Network, Says Twitter Exec," ReadWriteWeb, 14 September 2010. [Online]. Available: http://www.readwriteweb.com/archives/twitter_is_not_a_social_network_says_twitter_exec.php. [Accessed 5 June 2012].
[23] G. Lee, J. Seo, S. Lee, H. Jung, B. Cho, C. Lee, B. Kwak, J. Cha, D. Kim and J. An, "SiteQ: Engineering high performance QA system using lexico-semantic pattern matching and shallow NLP," in Proceedings of the Tenth Text REtrieval Conference (TREC 2001), 2001.
[24] M. a. S. A. a. L. E. Hu, "Comments-oriented document summarization: understanding documents with readers’ feedback," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 2008.
[25] @BlackRose50101, 7 June 2012. [Online]. Available: http://twitter.com/BlackRose50101/status/210752151255924736.
[26] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu and M. Demirbas, "Short text classification in twitter to improve information filtering," in Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, 2010.
[27] A. Oghina, M. Breuss, M. Tsagkias and M. de Rijke, "Predicting IMDb Movie Ratings using Social Media," in 34th European Conference on Information Retrieval (ECIR 2012). Springer-Verlag, 2012.
21
[28] A. Tumasjan, T. Sprenger, P. Sandner and I. Welpe, "Predicting elections with twitter: What 140 characters reveal about political sentiment," in Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, 2010.
[29] E. Ruiz, V. Hristidis, C. Castillo, A. Gionis and A. Jaimes, "Correlating financial time series with micro-blogging activity," in Proceedings of the fifth ACM international conference on Web search and data mining, 2012.
[30] G. Mishne, "Experiments with mood classification in blog posts," in Proceedings of ACM SIGIR 2005 Workshop on Stylistic Analysis of Text for Information Access, 2005.
[31] A. Go, L. Huang and R. Bhayani, "Twitter sentiment analysis," Final Projects from CS224N for Spring, vol. 2009, 2008.
[32] Twitter INC., "What are @Replies and Mentions?," Twitter INC., [Online]. Available: http://support.twitter.com/articles/14023. [Accessed 6 June 2012].
[33] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, Witten and I. H., "The WEKA data mining software: an update," SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10-18, 2009.
[34] Machine Learning Group at University of Waikato, "Weka 3: Data Mining Software in Java," Machine Learning Group at University of Waikato, [Online]. Available: http://www.cs.waikato.ac.nz/~ml/weka/. [Accessed 7 June 2012].
[35] Twitter INC., "GET search," Twitter INC., 18 April 2012. [Online]. Available: https://dev.twitter.com/docs/api/1/get/search. [Accessed 5 June 2012].
[36] A. v. d. Bosch, "Frog Dutch morpho-syntactic analyzer and dependency parser," ILK Research Group, 24 May 2012. [Online]. Available: http://ilk.uvt.nl/frog/. [Accessed 5 June 2012].
[37] Yahoo!, "Yahoo! Placemaker™ Beta," Yahoo!, [Online]. Available: http://developer.yahoo.com/geo/placemaker/. [Accessed 5 June 2012].
[38] The Python Software Foundation, "7.1. string - Common string operations," The Python Software Foundation, [Online]. Available: http://docs.python.org/library/string.html. [Accessed 6 June 2012].