Mining the wisdom of the crowds - Braintrust BASE · Mining the Wisdom of the Crowds “Detecting...
Transcript of Mining the wisdom of the crowds - Braintrust BASE · Mining the Wisdom of the Crowds “Detecting...
Mining the Wisdom of the
Crowds
“Detecting new product ideas by text mining and machine
learning techniques”
by Kasper Christensen
Number of characters: 125.853
May 2013
Supervisor: Professor Joachim Scholderer
Quantitative Analytics Group (QUANTS)
Department of Business Administration
School of Business and Social Sciences
Aarhus University
This, of course, is great idea, but these days we come to except such things
from Brian.
I've got to believe that some simple modules could be priced very affordably,
perhaps in the $15-$30 range. More complex ones, of course, could be priced
higher.
But the hook-them-together-ness would be a great selling point. It would be
most cool.
Comment: Message detected by the means of machine learning and text mining
List%of%tables%.........................................................................................................................%i!
List%of%figures%........................................................................................................................%i!
List%of%equations%..................................................................................................................%i!
List%of%appendices%..............................................................................................................%ii!
Abstract%...............................................................................................................................%iii!
1%7%Introduction%..................................................................................................................%1!1.1%7%Idea%generation%within%online%communities%...........................................................%1!1.2%7%Objectives%............................................................................................................................%2!1.3%7%Methodology%.......................................................................................................................%4!1.4%7%Delimitations%and%assumptions%...................................................................................%4!1.5%7%Structure%..............................................................................................................................%5!
2%7%Idea%generation%in%online%communities%...............................................................%6!2.1%7%Online%communities%.........................................................................................................%6!2.2%7%Collective%intelligence%.....................................................................................................%6!2.3%7%Creativity%.............................................................................................................................%7!2.4%7%Summary%..............................................................................................................................%9!
3%7%Detecting%ideas%..........................................................................................................%11!3.1%7%The%nature%of%data%in%online%communities%.............................................................%11!3.2%7%Text%mining%and%natural%language%processing%.....................................................%11!3.3%7%Machine%learning%in%the%text%classification%domain%............................................%14!3.3.1!%!The!imbalanced!learning!problem!...................................................................................!14!3.3.2!%!Feature!selection!.....................................................................................................................!15!3.3.3!%!Topic!modelling!........................................................................................................................!16!3.3.4!%!Classification!algorithms!......................................................................................................!17!3.3.5!%!Performance!measures!.........................................................................................................!20!
3.4%7%The%power%of%different%text%classification%methods%............................................%22!3.5%7%Summary%............................................................................................................................%26!
4%7%Aims%of%study%..............................................................................................................%27!
5%7%Method%.........................................................................................................................%28!5.1%7%Mining%the%online%community%of%Lugnet%.................................................................%28!
5.2%7%Construction%of%target%variable%..................................................................................%29!5.3%7%Modelling%the%concept%of%an%idea%...............................................................................%31!5.3.1!%!Data!exploration!......................................................................................................................!31!5.3.2!%!Data!partitioning!......................................................................................................................!32!5.3.3!%!Classification!algorithms!......................................................................................................!32!5.3.4!%!Term!weighting!scheme!.......................................................................................................!32!5.3.5!%!Data!processing!steps!............................................................................................................!33!5.3.6!%!Feature!selection!methods!..................................................................................................!34!5.3.7!%!Choice!of!final!model!..............................................................................................................!34!
5.4%7%Effect%of%seasonality%and%historical%events%on%idea%generation%.......................%34!
6%7%Results%..........................................................................................................................%36!6.1%7%Reliability%of%the%manual%classification%of%target%variable%................................%36!6.2%7%Detecting%ideas%................................................................................................................%37!6.2.1!%!Data!partitioning!......................................................................................................................!37!6.2.2!%!Exploratory!analysis!..............................................................................................................!37!6.2.3!%!Classifier!performance!given!term!weighting!and!processing!steps!................!38!6.2.4!%!Classifier!and!term!weighting!scheme!given!processing!steps!...........................!39!6.2.5!%!Assessing!performance!given!varying!feature!selection!methods!.....................!40!6.2.6!%!Assessing!candidate!models!...............................................................................................!43!
6.3%7%Effect%of%seasonality%on%idea%generation%.................................................................%45!6.3.1!%!Exploratory!analysis!..............................................................................................................!45!6.3.2!%!Creating!dataset!and!handling!missing!data!................................................................!46!6.3.3!%!Variables!exploration!and!variable!transformations!...............................................!47!6.3.4!%!Defining!model!and!assessing!model!assumptions!..................................................!48!6.3.5!%!Parameter!estimates!and!goodness%of%fit!.....................................................................!51!
7%7%Discussion%&%Conclusion%........................................................................................%53!
8%7%Bibliography%...............................................................................................................%57!
i
List%of%tables%Table&1&(&Overview&over&studies&reviewed&.....................................................................................................................&25!Table&2&(&Twenty&discriminative&terms&and&four&positive&and&negative&topics&from&training&set&.........&38!Table&3&(&Results&term&weighting,&processing&and&feature&selection&assessment&.........................................&42!Table&4&(&Results&of&candidate&models&performance&.................................................................................................&45!Table&5&(&Twenty&discriminative&terms&and&four&positive(&and&negative&topics&from&prediction&set&...&46!Table&6&(&Regression&results&.................................................................................................................................................&51!
List%of%figures%Figure&1&(&ROC&chart&of&candidate&models&.....................................................................................................................&43!Figure&2&(&ROC&chart&of&support&vector&machines&with&an&under(&and&oversampled&training&set&........&44!Figure&3&(&Fluctuations&in&ACTIVITY&and&IDEA&given&YEAR&..................................................................................&49!Figure&4&(&Histograms&of&ACTIVITY&and&IDEA&.............................................................................................................&49!Figure&5&(&Histogram&of&ERPM&and&box&plot&of&ERPM&from&1999&to&2012&......................................................&49!Figure&6&(&Histogram&of&LN.ERPM&.....................................................................................................................................&50!Figure&7&(&Fluctuations&in&LN.ERPM&given&MONTH&and&YEAR&.............................................................................&50!Figure&8&(&Residuals&plot&and&histogram&of&residual&distribution&.......................................................................&50!
List%of%equations%Equation&1&(&Information&gain&...........................................................................................................................................&16!Equation&2&(&Chi(square&.........................................................................................................................................................&16!Equation&3&(&Optimal&margin&classifier&objective&function&....................................................................................&17!Equation&4&(&Objective&function&of&support&vector&machine&with&slack&variables&........................................&18!Equation&5&(&Radial&basis&transformation&.....................................................................................................................&18!Equation&6&(&Bayes&theorem&................................................................................................................................................&19!Equation&7&(&Accuracy&............................................................................................................................................................&21!Equation&8&(&Recall&..................................................................................................................................................................&21!Equation&9&(&Precision&............................................................................................................................................................&21!Equation&10&(&F(measure&......................................................................................................................................................&21!Equation&11&(&Event&rate&per&month&................................................................................................................................&47!Equation&12&(&Logit&transformed&event&rate&per&month&.........................................................................................&48!Equation&13&(&Regression&model&........................................................................................................................................&48!
ii
List%of%appendices%Appendix&A&(&Message&view&at&www.lugnet.com&...........................................................................................................&A!Appendix&B&(&Message&in&.eml&format&................................................................................................................................&B!Appendix&C&(&Message&in&.txt&format&...................................................................................................................................&C!Appendix&D&(&Descriptive&statistics&of®ression&data&...............................................................................................&D!
%
iii
Abstract%The rise of Web 2.0 coupled with the availability of information technology like
computers, tablets and smartphones has created increasing opportunities for
consumers and organizations to interact. This development is predicted to
revolutionize how we understand and utilize innovation. In product innovation, the
topic of idea generation has for a long time been a topic of interest and from this
perspective, the concept of collective intelligence and crowdsourcing have become
areas of interest, in the product innovation literature. One drawback of crowdsourcing
is that it often requires special software to utilize and a large number of persons who
are dedicated to the crowdsourcing task. In this thesis we propose an alternative
method for utilizing the wisdom of the crowds, based on text mining and machine
learning. Our study shows how a classification algorithm can be trained to detect
ideas generated inside an online community. Furthermore, we study the effect of
seasonality and historical event on idea generation inside a Lego online community.
Our results suggest that holiday seasons have an impact on idea generation in online
communities in our particular case of toys. The primary implication of our results is
that organisations can use our method to tap into online sources and detect ideas even
though the source was never designed for crowdsourcing.
1 of 65
1%6 Introduction%
1.1%6 Idea%generation%within%online%communities%
The rise of Web 2.0 has led to new opportunities for organizations to interact with
consumers. The increasing availability of information technology like computers,
tablets and smartphones (Tapscott & Williams, 2008) has created “the era of big
data” (Hsinchun Chen, Chiang, & Storey, 2012, p. 1185). A key characteristic of the
big data age is that information on the Web is not structured in tables. Rather data is
stored often in the form of simple text. Sources estimate that approximately 80% of
organizations’ information is stored as text (Tan, 1999). This being the case then the
Web offers a huge potential for analytical approaches such as machine learning and
text mining, which are geared towards big data (Hsinchun Chen et al., 2012). Several
authors have pointed out the potential these massive amounts of data might offer to
organisations, suggesting that businesses will benefit if they can find ways to manage
the massive amounts of data (Johnson, 2012; LaValle, Lesser, Shockley, Hopkins, &
Kruschwitz, 2011; T. W. Malone, Laubacher, & Dellarocas, 2010; Eric Bonabeau,
2009; Argamon & Olsen, 2006). Some even go so far as claiming that information
technology is about to revolutionize innovation in all of its facets, by allowing
organisations to tap into these huge amounts of data and exploit them for innovation
purposes (Brynjolfsson, 2010).
Idea generation and innovation management are sources of value creation for
an organisation, due to the low success rate of new product developments
(Goldenberg, Lehmann, & Mazursky, 2001, Di Gangi, Wasko, & Hooker, 2010).
Although idea generation has primarily been the task of company professionals,
crowdsourcing has proven to be a suitable method for generating new ideas (Poetz &
Schreier, 2012). Some even suggest that crowdsourcing can in fact outperform ideas
created by company professionals on the attributes of novelty and customer benefit
(Poetz & Schreier, 2012).
Crowdsourcing is an open innovation model, and a way for the firm to “enrich
the company’s own knowledge base through the integration of suppliers, customers,
and external knowledge sourcing” (Enkel, Gassmann, & Chesbrough, 2009, p. 312).
From an open innovation perspective, one can be misled into believing that no
2 of 65
distinction exists between “open source” and “crowdsourcing” (Albors, Ramos, &
Hervas, 2008). However, this is not true, because for something to become open
source, the crowdsourcer must give everybody permission to modify the product.
Examples of open source include Linux and the programming language R, whereas
examples of crowdsourcing are Threadless, iStockphoto, Innocentive, Amazon’s
mechanical Turk, Youtube, etc. (Brabham, 2008; Estellés-Arolas & González-
Ladrón-de-Guevara, 2012). A recent successful example of this is the Dell IdeaStorm
community, which generates new product ideas from crowdsourcing. This community
is based on a collaborative filtering system where users are suggesting and voting on
the ideas they like. In this way Dell will be able to identify the most popular ideas,
without assigning corporate staff to create and/or assess the ideas. When the most
popular ideas have been identified, the ideas can be distributed to the relevant
departments in Dell and be used as input for developing new products or services
(Poetz & Schreier, 2012; di Gangi et al., 2010)
1.2%6 Objectives%Sufficient research documenting the potential of ideas generated by the crowd,
however our review of academic literature did not discover whether these ideas could
be detected by means other than collaborative filtering, as used in the Dell case. We
define our main research question as:
• How are ideas generated in online communities and how can one detect these
ideas by applying text mining and machine learning?
The main objective of this thesis is to assess if one can successfully detect ideas inside
an online community. We consider this relevant as it will allow organizations and
researchers to detect ideas from other sources than crowdsourcing communities.
Detecting ideas without asking people to vote will widen the scope of the
crowdsourcing concept to include all types of online communities. The data source
which we mined was as a Lego community called Lugnet1. Lugnet is a company
independent online community, where any person with an interest in Lego can post a
1 http://www.lugnet.com
3 of 65
message. The goal of Lugnet is to unite Lego fans from all over the world, and
anybody can read the messages posted on Lugnet. However in order to post messages,
one needs to pay a membership fee of $10. The forum consists of 247 sub-forums,
categorized by a variety of brands, products, countries, etc. Lugnet is a company
independent website, and so one can discuss whether this is actually a good case of
crowdsourcing. However, Lego can read what happens on the forum, and there will
assumedly be individuals in the forum writing about ideas. This makes Lugnet a good
case for this thesis.
The second objective of this thesis is to assess the extent to which seasonality and
historical events influence idea generation inside online communities, therefore we
define our second research question as:
• To what degree do seasonality and historical events influence idea generation
inside online communities?
If one can show that idea generation is not dependent on e.g. seasonal factors, it
leaves room for investigating other factors, e.g. marketing spending, that might
influence idea generation inside online communities.
To answer these two questions we first must address how ideas are generated inside
online communities, in particular how collective intelligence and consumer creativity
influences the development of ideas in online communities. Both fields are included
because collective intelligence theory focuses on group creativity, while consumer
creativity explains the creative ability of individuals. We view collective intelligence
and consumer creativity as the clockwork that generates ideas, so in order to help
answer our main question, one need a basic understanding of these two concepts.
Therefore we ask as our first sub-question:
• What defines idea generation inside online communities and how are these
ideas generated from a perspective of collective intelligence and consumer
creativity?
4 of 65
Furthermore, our two research questions will benefit from a discussion about the
nature of the data contained in online communities, as well as how one can detect
ideas hiding inside these communities through text mining and machine learning. This
topic is relevant as no literature in academia reports on how to detect ideas inside
online communities by the means we propose, therefore we ask as our second sub-
question:
• What is the nature of data created inside online communities and how can
techniques from text mining and machine learning be combined to detect ideas
generated inside online communities?
1.3%6 Methodology%
Our approach to answering the main research question is inductive. We argue that if
we can determine which factors influence idea generation inside a single online
community, then this may apply to other online communities of a similar nature.
We rely on quantitative approaches in most facets of our study, in particular
we rely on text mining and machine learning. We do however also rely on qualitative
assessments, as we use human judges to construct our target variable.
1.4%6 Delimitations%and%assumptions%
We must first acknowledge that we do not have access to what is considered a
crowdsourcing community in its most applied shape, since crowdsourcing
communities are often initiated and owned by a firm that proposes a task. Rather the
online community we use has existed since the mid-1990’s, and so allows us look at
idea generation over a rather long time period. To use our results from a
crowdsourcing perspective, we assume that crowdsourcing communities are
influenced by the same factors as online communities in general. We consider this to
be a reasonable assumption, and one might actually debate if crowdsourcing and
simple online communities are alike (Estellés-Arolas & González-Ladrón-de-
Guevara, 2012; Vukovic & Bartolini, 2010; Buecheler, Sieg, Füchslin, & Pfeifer,
2010; Brabham, 2008)
Furthermore, our secondary research question will only assess seasonality and
historical events. We consider this a natural delimitation, due to the nature of the data
we have available.
5 of 65
1.5%6 Structure%The theoretical foundations of the thesis will be addressed in chapters two and three
Chapter two will address collective intelligence and creativity from the perspective of
idea generation within online communities. Chapter three outlines text mining and
machine learning, which are the tools necessary to detect ideas in online communities.
Chapter four defines the aim of our study, chapter five covers methods used and the
study setup, chapter six reports the results and in chapter seven we discuss and
conclude on our results.
6 of 65
2%6 Idea%generation%in%online%communities%%The objective of this chapter is to understand how collective intelligence and
consumer creativity lead to the generation of ideas inside online communities. In
particular, we wish to investigate what defines idea generation inside online
communities, and how these ideas are generated in the perspective of collective
intelligence and consumer creativity. As this can be somewhat theoretical, we will
relate the theory covered to examples of ideas from the online community of Lugnet,
which we will later use as our data source.
2.1%6 Online%communities%
Online communities are a term that covers several genres of online networks.
Personal homepages, messages boards, E-mail lists and newsletters, chat groups,
weblogs and directories, as well as wikis, are genres that fits under the online
community umbrella (Bishop, 2009). The common drivers of these communities are
that they allow people to exchange goods or information, by interacting with other
people in the network (Wilson & Peterson, 2002). Online communities can be seen as
what one would associate with normal communities, where people gather based on a
common interest. However, one of the differences between online communities and
communities in their traditional shape, is that online communities are virtual and that
people do not always know each other in real life. Another difference is that since
people do not always know each other, they are not aware of who they might share
interests with inside the community (Faraj, Jarvenpaa, & Majchrzak, 2011). Even
though people inside online communities might not know each others skills and
interests, online communities have already proved useful for collective creativity and
innovation purposes (Dahlander, Frederiksen, & Rullani, 2008; Tapscott & Williams,
2008).
2.2%6 Collective%intelligence%The benefit of a crowd’s collective intelligence can be illustrated in terms of the
simple inequality of collective outcome ≥ sum of individual efforts (Fischer,
Giaccardi, Eden, Sugimoto, & Ye, 2005). Collective intelligence is by no means a
new idea (Leimeister, 2010), but what is new is the potential scale of collective
intelligence enabled by Web 2.0. To gain some perspective, one might see traditional
7 of 65
research techniques, like the application of surveys and focus groups, as an attempt to
tap into the collective intelligence of existing or potential customers. If we stay with
this comparison, surveys can be seen as a way of averaging the intelligence of a group
(Segaran, 2007), whereas focus groups provides a different form of collective
intelligence, as it allows people to interact, and thereby create solutions (Eric
Bonabeau, 2009).
Taking the concept of collective intelligence in its two separate terms of
“collective” and “intelligence”, we achieve a deeper understanding of what the
concept actually contains. The term “collective” describes a group of individuals who
are not required to have the same attitudes or viewpoints (Leimeister, 2010). The term
“intelligence” refers to the ability to learn, to understand, and to adapt to an
environment by using own knowledge (Leimeister, 2010). Collective intelligence is
defined as (T. Malone et al., 2009, p. 2)2:
“A group of individuals doing things collectively that seems intelligent.”
This definition captures the essence of collective intelligence, which is that we are
dealing with a group of people who collaborate in order to solve a problem. The
definition does not state that a group of individuals performing collective intelligence
is solving a problem, but this is assumed to be reasonable in most cases (Burroughs,
Morreau, & Mick, 2008)
2.3%6 Creativity%An important facet in problem solving is an individual’s creative ability. Especially
when designing new products and being innovative, creativity is important (Sarkar &
Chakrabarti, 2011). According to Burroughs et al. (2008), the concept of creativity
can be separated into the three perspectives of the creative process, the creative
person and the creative outcome.
i. The creative process - the generation of an idea is not a two-step
process, where a task is proposed in the first step and a perfect solution
is proposed in the next. One can instead see the generation of an idea
2 Please note that the source is only a working paper, but we do consider the main author Thomas W. Malone as reliable.
8 of 65
as a process containing four stages: Exploration, incubation,
illumination and verification. In the explorative phase, the individual
searches for known solutions to the problem. In the fixation phase, the
individual decides on a given path that is likely to lead to a solution. In
the incubation phase, the individual becomes unfocussed due to mental
exhaustion leading to new ways of looking at the problem. In the final
phase of insight, the individual reaches for a solution to the problem
(Burroughs et al., 2008).
ii. The creative person - the creative intelligence of an individual can
primarily be defined in terms of abilities, motivations and affect
(Amabile, 1983). Two factors of special importance are the knowledge
background of the person and the person’s motivation. Knowledge
background is domain specific in the sense that an individual might be
very knowledgeable within one domain, but not know anything about
another domain. Motivation can be intrinsic or extrinsic in nature.
Intrinsic motivation can be thought of as the degree an individual is
truly motivated from within to participate in a given creative activity.
Extrinsic motivation is the opposite, as the individual is motivated by
some kind of reward, often of monetary nature (Ryan & Deci, 2000;
Hennessey & Amabile, 2010).
iii. The creative outcome - An individual engaged in a creative process
should at some point generate a creative outcome, or in our case an
idea. The creative outcome can be of varying nature, which implies
that a creative outcome does not have to be for example, the work of
Einstein or Mozart, but can also include, a man getting an idea to fix
his car by using nothing else than a hairpin (Hennessey & Amabile,
2010). All are valid examples of individuals in creative processes,
producing a creative outcome, product or an idea (Reisberg, 2010).
The three text passages below are examples of ideas extracted from a poll of
messages from the online community we pulled our data from. We had two human
judges to classify pieces of text or messages from within the online community. The
first text displayed below is an example where two out of two judges classified the
text as an idea.
9 of 65
“Actually, I think Winnie the Pooh will be a big hit in the Duplo Market.
My wife is certainly looking forward to the Pooh LEGO.”
The second text passage displayed below is an example where one out of the two
judges classified the text as an idea.
“Long term, the true solution is to move to a development tool that
delivers smaller application-footprints than VB. Which is just about
anything.
I don't know of any time that would definitely be better than any other.
I'd suggest trying in the morning, your time. Most everyone around here
is asleep at that time.”
The final text passage showed below is an example where two out of two judges
classified the text as a non-idea.
“Can I build a Historic Site replica building in this scale
to show more accurate details?”
Returning to our theoretical discussion, defining the degree of creativity, is a central
problem in idea generation and creativity research, because setting the boundaries for
what constitutes an idea is difficult (Kaufmann, 2004). The fact that our judges could
not agree on categorizing the second text passage is an example of this problem. If we
look at the first text passage this is a forward case of a product idea, because it
supports that there might be a need for “Winnie the Pooh” Lego. The third text
passage is also straight forward as this can be categorized as a question, rather than
anything else. Our personal opinion about the second text message is that it is an idea,
as the terms “solution” and “suggest” occurs in the text passage. However we do
recognize that it is not a straight forward case, compared to the two other cases.
2.4%6 Summary%
Online communities can take many different shapes. Common for all online
communities is that they provide a platform for people to exchange goods or
10 of 65
information, by interacting with other people in the network. Within online
communities, idea generation is a type of problem solving, where the collective
outcome is often bigger than the sum of individual efforts. To be more specific, ideas
are often the product of an individual or group’s creativity. Creativity requires the
individual or a group to go through several phases before an idea or a solution is
reached. Creative ability is mainly determined by the two factors - domain knowledge
and motivation (both intrinsic and extrinsic). Finally the creative outcome or the idea,
can take many shapes, and can be difficult to assess.
11 of 65
3%6 Detecting%ideas%In this chapter we investigate how idea generation inside online communities can be
detected through text mining and machine learning. This we will do by answering
what is the nature of data created inside online communities and what text mining-
and machine learning techniques need to be combined in order to detect ideas
generated in online communities.
3.1%6 The%nature%of%data%in%online%communities%
When applying the term of data, we are dealing with information stored digitally in a
structured or an unstructured format and as stated in the introduction we are operating
in a Web 2.0 domain. Operating in a Web 2.0 domain also means that one will be
handling big amounts of data (Bughin, Chui, & Manyika, 2010). One might say that
data becomes big data when it is “is too big for conventional systems to handle”
(Gobble, 2013, p. 64). How data becomes big, depends on three dimensions - volume,
frequency and variety (Gobble, 2013). In this context online communities scores high
on these three dimensions, as the data quantity is often large, it changes every time
people use the community, and it is to a high degree unstructured (as most of the data
is text-based). To analyse this type of data requires the use of big data analytics, like
text mining and machine learning (Hsinchun Chen et al., 2012).
3.2%6 Text%mining%and%natural%language%processing%
To understand better the concept of text mining, one can turn towards the concept of
natural language processing (NLP). NLP was originally a mixture of artificial
intelligence and linguistics, taking its beginning in the 1950’s. Word-for-word
machine translation provides a good example of homography, which is a common
problem of NLP, as one word can have different meanings depending on context. The
complex nature of NLP led to a shift of focus in the 1980’s. This shift meant that NLP
should extract semantics or meaning, to a higher degree. This shift ultimately meant
the birth of statistical NLP, including the use of machine learning and statistics for
NLP purposes, as well as the idea of annotated corpora’s for use in machine learning
(Nadkarni, Ohno-Machado, & Chapman, 2011). An annotated corpus can be seen as
the equivalent of a training set in traditional data mining terms.
12 of 65
Text mining can be defined as the “knowledge-intensive process in which a
user interacts with a document collection over time, by using a suite of analysis tools”
(Feldman & Sanger, 2006, p. 1). The aim of text mining is to “extract useful
information from data sources through the identification and exploration of
interesting patterns” (Feldman & Sanger, 2006, p. 1). This definition is similar to the
definition of data mining (see Linoff & Berry, 2011 for further explanation of data
mining), but whereas data mining is based on data stored in database records,
normally structured by rows and columns, text mining is based on unstructured text
data, stored inside collections of documents.
The task of turning unstructured text into rows and columns results in a bag-
of-words for a given document (Erk & Padó, 2008). This allows one to apply
conventional machine learning methods to model a given concept which is often
represented by a target variable within a training set, depending on if one is doing
supervised or unsupervised machine learning (Dharmadhikari, Ingle, & Kulkarni,
2011).
The bag-of-words concept can be seen as one extreme end of a scale, while the
NLP approach is at the other end of the scale. Bag-of-words is very naive in its
nature, as it simply ignores grammar and any relations between the words. NLP on
the other hand, tries to capture semantics, represented by a set of words within a
document. The relationship between the bag-of-words approach and the NLP
approach is a trade-off, and it is important to stress that it is not necessarily a decision
about using one or the other, as there are techniques that allow one to capture more
meaning than the simplest bag-of-words approach will allow (Linoff & Berry, 2011).
Simple text processing can be seen as the removal of numbers, removal of
punctuation marks, removal of stopwords, removal of whitespaces, stemming,
tokenization, pruning, the conversion of upper case letters to lower case letters, use of
n-grams, and which term weighting scheme to apply. Tokenization refers to the
splitting of a document into smaller segments, which are often just single terms and is
an implicit step in the bag-of-words approach. (I. Feinerer, Hornik, & Meyer, 2008; ;
Zanasi, 2007). Some of these tasks are self-explanatory, but pruning, stemming, n-
grams and choice of term weighting scheme requires elaboration.
i. Pruning: the task where a word is deleted based on the number of
times it occurs within a document collection, be it extremely few or
many times. This means that one sets an upper and lower threshold for
13 of 65
how many times a given term should occur in order to be a part of the
analysis. As an example if one set the upper pruning level to 0.99, this
means that all terms occurring in more than 99% of the messages will
be omitted from the analysis. Pruning is a very aggressive way of
deleting features, and therefore one needs to be careful when setting
pruning boundaries. Pruning can be seen as a necessary evil, as even a
very small collection of documents will create many unique terms,
leading to a lot of noise as well as increased computation time.
ii. Stemming: The cutting of a word down to its stem, with a minimal loss
of information. Stemming can be seen as dimensionality reduction
method (Linoff & Berry, 2011).
iii. N-grams: A more advanced processing method that extracts ordered
sets of terms or characters. Instead of having tokens that only contain a
single term, it can contain a single term as well as several terms. One
can look at this step as taking the bag-of-words approach a step in the
direction of the NLP approach (Zanasi, 2007).
iv. Term weighting: is refers to the numerical representation of terms in
the bag-of-words. In general, a good term weighting metric should
discriminate one unstructured text source from another. There are
variety of weighting schemes, such as binary term occurrences, term
occurrences, normalized term frequency and term frequency inverse
document frequency. Binary weighting scheme assigns the value of
either one or zero to a term, regardless of how many times it occurs in
a particular document. Term occurrences can take integer values
ranging from zero to the total number of terms inside each document.
Term frequency comes in several variants, one version takes the length
of the document into account by normalizing the frequency count with
the square root of the total number of terms in that document (Salton &
Buckley, 1988). Term frequency inverse document frequency is an
expression of how discriminating terms are in comparison to the whole
document collection (Zanasi, 2007; Feinerer et al., 2008).
14 of 65
3.3%6 Machine%learning%in%the%text%classification%domain%
Machine learning can be seen as a combination of mathematics, statistics, software
engineering, and computer science (Conway & M. White, 2012). It is “concerned
with turning data into something practical and usable” (Conway & M. White, 2012,
p. 1). In particular, one can see machine learning as the problem of making inferences
about a given concept (Witten & Frank, 2005). The strength of machine learning is
that it allows us to infer patterns over large quantities of data, as you let machines do
what they do best, namely do calculations (Linoff & Berry, 2011). A common
weakness of machine learning is, that the models learned tend to over fit or
overgeneralize the concept.
3.3.1%6 The%imbalanced%learning%problem%
A general problem in machine learning, is when the class distribution of the target
variable is severely skewed. One can refer to this problem as the imbalanced learning
problem. Defining when a dataset becomes imbalanced between classes is relative,
but thresholds of 100:1, 1.000:1 and 10.000:1 are reported as common (He & Garcia,
2009). Although skewness happens for several reasons, commonly it happens because
either the classes are skewed by nature and/or it is very costly to collect the data from
one of the other classes. A consequence of skewed distributions upon the target
variable is typically a bias towards the majority class (Kao & Poteet, 2007). This does
not mean that one cannot use skewed datasets, and it has long been the assumption
that classification algorithms should be trained on data that have a similar distribution
to the one the algorithm occurs naturally (Weiss & Provost, 2001). It has however
been shown that balancing the dataset has a positive effect on classifier performance
(Weiss & Provost, 2003). Several methods exist to solve the problem of class
imbalance. Two of the most basic and widely described techniques are random
undersampling and random oversampling (Kotsiantis, Kanellopoulos, & Pintelas,
2006). Random undersampling removes cases from the majority class at random,
whereas random oversampling resamples cases from the minority class at random, in
order to create a more balanced relationship between classes. A disadvantage of
random undersampling is that one risks throwing away valuable information (Xu-
Ying Liu, Jianxin Wu, & Zhi-Hua Zhou, 2009), whereas a disadvantage of random
oversampling is that one risks over fitting the data (Chawla, 2010). In the context of
undersampling against oversampling, some research findings indicate that
15 of 65
undersampling performs best (Drummond & Holte, 2003). Other sources report that a
combination of the two techniques can be used (Estabrooks, Jo, & Japkowicz, 2004).
3.3.2%6 Feature%selection%
Feature selection is a way of reducing dimensionality. The concept falls into two
categories - statistical feature selection and arbitrary feature selection. Statistical
feature selection is the technique whereby the distribution of features counts between
classes is used to assign weights to each feature. Based on these weights, the features
with the highest weights train the classifier. Arbitrary feature is also known as the
processing steps of stopword removal, stemming and pruning (Yu, 2008), which were
discussed earlier. Statistical feature selection chooses features that discriminate best,
based on the distribution of terms between the classes of the target variable. Feature
selection can lead to increased model performance, both in terms of accuracy and the
generalizability of a model, because too many features can increase the likelihood of
over fit (Yang & Pedersen, 1997). Even though we have not yet presented the support
vector machine classifier, it deserves to be mentioned here whether support vector
machines can actually benefit from feature selection (Guyon, Weston, Barnhill, &
Vapnik, 2002), as some claim that support vector machines are so robust to over
fitting that they do not need any feature selection technique (Forman, 2003). The
outcome of applying feature selection is highly dependent on the nature of the data,
meaning that one will need to experiment, as there is no generic rule for applying
feature selection (Kao & Poteet, 2007). We will restrict ourselves to the two feature
selection methods of information gain and chi-square. We choose these two methods,
as both have been used within text classification with good results (Zhang, Zhu, &
Yao, 2004; Zheng, Wu, & Srihari, 2004; Forman, 2003; Yang & Pedersen, 1997).
Information gain and chi-square are functions of the four measures of A, B, C
and D. A is an integer count of messages belonging to !"#$$! where !"#$! occurs
minimum one time. B is an integer count of messages not belonging to !"#$$! where
!"#$! occurs minimum one time. C is an integer count of messages belonging to
!"#$$! where !"#$! does not occur. D is an integer count of messages not belonging
to !"#$$! where !"#$! does not occur. N is the total number of documents.
Information gain can be calculated by the following equation (Kao & Poteet, 2007):
16 of 65
Equation 1 - Information gain
Information!gain = − !!!! log !!!
! + !!! log!
!!! + !! log
!!!! (1)
Chi square can be calculated by the equation (Kao & Poteet, 2007): Equation 2 - Chi-square
Chi− square = ! !"!!" !
!!! !!! !!! !!! (2)
Based on a given feature selection technique, one can then calculate how good a
discriminator a given term is for a given class. In order to sort out noise one can then
order the features by the respective weights calculated, and then only use the best
features for modelling.
3.3.3%6 Topic%modelling%
In the domain of machine learning and natural language processing, probabilistic
topic modelling can be seen as the task of assigning topics labels to text documents.
Labelling should not be confused with the labelling one does when doing
classification. This is because the label has no meaning before a human being have
done the labelling of the topics. As an example one can perform topic modelling on a
collection of text documents and the result of the analysis could then be ten topics
named topic one to topic ten. Each topic would then contain a list of words assigned
with probability estimates. The probability estimates can then be interpreted as the
likelihood that a word is contained in the given topic. One can order the list by the
probability estimates and use the words with the highest probability estimates to
assign meaningful labels to the topic. One would then have to look at each list of
words and use reasoning to manually assign topics to each list of words. A topic
containing the top five words “song”, “guitar”, “sound”, “play” and “drums”, one
could then assign the label ”music”. Similarly as topic where the top five words are
“ball”, “player”, “play”, “attacker”, and “goal” one could assign the label “football”.
One may notice that the word “play” occurs in both topics. This one can refer to as
the problem of polysemy, which means that one word can have multiple meanings
depending one topic. This means that one will have to look at other words in order to
decide one the meaning of the word with multiple meanings (Steyvers & Griffiths,
2007).
17 of 65
Topic models comes in several variants. Common for all topic models are that
the apply term frequencies as term weights, which also leads to the bag-words-
assumption. Two variations are the latent Dirichlet allocation and the correlated topics
model. The main difference between the two types of models is that the latent
Dirichlet allocation assumes no correlation between topics. The correlated topics
model allows for topics to be correlated, which one can argue is more realistic (Grün
& Hornik, 2011). We choose to the latent Dirichlet allocation because we consider the
results of this model as being the easiest to interpret in our case.
3.3.4%6 Classification%algorithms%
Several algorithms are available for text classification, including k-nearest
neighbours, naïve Bayes, decision trees, decision rules and support vector machines
(Dharmadhikari et al., 2011). For this paper, we choose to focus on support vector
machines and naïve Bayes algorithms.
Support vector machines
Support vector machines were developed by Corinna Cortes and Vladimir Vapnik in
1995. Originally support vector machines were named support vector networks.
Support vector machines build upon the optimal margin classifier algorithm (Boser,
Guyon, & Vapnik, 1992). The optimal margin classifier identifies the widest hyper
plane able to separate two classes. The wider the margin, the better the generalization
ability of the classifier. In mathematical terms this can be defined as the problem of
minimizing the objective function (Ben-Hur & Weston, 2010; Scholderer, 2013): Equation 3 - Optimal margin classifier objective function
f ! = !! ! ! (3)
In their 1995 paper, Cortes and Vapnik introduce the support vector machine and the
notion of soft margins also called slack variables. A soft margin allows for class
overlap. If the cases cannot be perfectly separated, the support vector machine
algorithm seeks to identify a hyper plane able to minimize the overlap of the cases
(Hastie, Tibshirani, & Friedman, 2008). If the objective function is rewritten to
include slack variables, we get a new objective function able to deal with non linearly
separable cases. We still want to minimize the objective function (Ben-Hur &
Weston, 2010 Scholderer, 2013):
18 of 65
Equation 4 - Objective function of support vector machine with slack variables
f ! = !! ! ! + C ξ!!
!!! (4)
One can apply the linear variant of a support vector machine, which objective
function we have just defined. In its linear form, the support vector machine can be
seen as the equivalent of a linear discriminant function (Ben-Hur & Weston, 2010;
Hastie et al., 2008). One can also use a kernel. Support vector machines utilizing a
kernel transformation, work in the way that in the case of non-linearly separable data,
the data are transformed into a higher dimensional feature space (Cortes & Vapnik,
1995). The transformation is chosen beforehand, and the final result is a linearly
separable decision boundary. This transformation is also known as the “kernel trick”.
A kernel can take many different shapes, such as a Gaussian kernel which is also
knows as the radial basis function (Conway & M. White, 2012). We will limit
ourselves to the linear version of the support vector machine and the radial basis
version. The radial basis transformation is mathematically defined as (Ben-Hur &
Weston, 2010 Scholderer, 2013): Equation 5 - Radial basis transformation
k!"# !!, !! = exp! −γ !! − !!!
(5)
An advantage of the radial basis kernel is that with properly defined tuning
parameters, it can take the shape of the linear support vector machine (Keerthi & Lin,
2003). What however makes the linear kernel worth considering, is it only requires
the optimization of one parameter, whereas the radial basis kernel requires the
optimization of two parameters, adding to the computational costs (Hsu, Chang, &
Lin, 2010). A way to work with these two versions is to use the linear support vector
machine as a baseline, and use the radial basis kernel to try and improve performance
of the model at the end of the modelling phase (Ben-Hur & Weston, 2010).
The linear support vector machine depends on the parameter of the soft
margin constant C, whereas the radial basis kernel depends on both the soft margin
constant C and the hyper parameter γ (Varewyck & Martens, 2011). The function of
the C parameter is not different from the linear and the radial basis version.
Decreasing C widens margins to provide more generalized results and less over fitting
to a classifier (Ben-Hur & Weston, 2010). The C parameter plays the same role
19 of 65
regardless of kernel or no kernel, but the radial basis kernel also takes into account γ.
When optimizing γ one allows flexible margins, but as with the C parameter, this
might also lead to over fitting. The effect of changing γ follows the same rule as the C
parameter, which is that the lower γ the better the algorithm is a generalizing (Ben-
Hur & Weston, 2010).
The training of an optimal support vector machine classifier requires one to
make several decisions, about how to train the classifier. These decisions are how to
prepare the data, what kernel to use, and the parameters of the support vector machine
and the kernel. Data preparation is not different from the normal steps one undertakes
in order to prepare data for analysis (Ben-Hur & Weston, 2010). Deciding on what
kernel to use depends on the nature the data and the underlying relationships between
the features.
The naïve Bayes classifier
The naïve Bayes algorithm is an alternative approach to classification compared to
support vector machines. Naïve Bayes is considered a solid approach to text
classification tasks because it can handle large amounts of features, much like support
vector machines (Dharmadhikari et al., 2011). The naïve Bayes classifier is based on
Bayes theorem (Han & Kamber, 2006): Equation 6 - Bayes theorem
P H ! = ! ! ! !(!)!(!) (6)
The naïve Bayes classifier works in the way that when given a training set labelled
with class occurrences, it first calculates the prior probability of a given document
belonging to given class. The prior probability expressed by P(H) is the probability
that a document will belong to a class regardless of the contents of the document.
Secondly the classifier calculates the posterior probability of X conditioned on H
expressed by P(X|H). This posterior probability can thought of as the probability that
a document contains certain terms, given that we know that the document belongs to a
given class. Thirdly the prior probability P(X) is calculated. The prior probability can
be thought of as the probability that a document from our corpus contains a set of
terms. P(X) is assumed to be constant for all classes. Performing these calculations
and applying Bayes formula enables one to calculate the posterior probability of H
20 of 65
conditioned on X. This can be thought of as whether a given document belongs to a
certain class, based on the terms the document contains (Han & Kamber, 2006).
The naïve Bayes classifier comes in two different variations - the multivariate
Bernoulli model and the multinomial model. The multivariate Bernoulli model
applies binary word vectors as representation of the document. This means that a
given term is represented either as zero or one. Zero means that the term is not
contained inside a given document, one means that a term appears at least once inside
a given document, e.g. if a term appears five times inside a document, it will be
assigned the value of one. The multinomial model uses word frequencies as input.
This means that instead of assigning only values of zero and one to a given term, a
term can be assigned an integer value from zero to the length of the document. In this
way, if a given term occurs five times in a document, it is assigned the value of five.
Naïve Bayes rests on the assumption that terms inside a document are
independent of one another. This assumption is unrealistic in real world settings.
Despite this, the naïve Bayes classifier often performs well (McCallum & Nigam,
1998). It has been shown that both versions of the naïve Bayes classifier perform
equally with small vocabularies, whereas the multinomial model performs better as
the size of vocabulary increases. We choose the naïve Bayes over k-nearest
neighbours and the other learning algorithms, because of its ability to perform well on
training set with few cases (Dharmadhikari et al., 2011).
3.3.5%6 Performance%measures%%
As an introduction to model performance measures, we start with the confusion
matrix, as this is at the heart of model assessment. In short, the confusion matrix
displays the combination of true positives (TP), false positives (FP), true negatives
(TN) and false negatives (FN) in a two by two matrix (for binary classification). The
values of the main diagonal is the correct predictions, whereas the upper right and
lower left quadrant are misclassifications (Chawla, 2010).
The receiver operating characteristic curve (ROC curve) is created by plotting
the true positive rate against the false positive rate, which is the same as sensitivity
plotted against 1-specificity. Assessing the ROC curve visually enables one to assess
how well a model performs, but as a check on performance it cannot stand alone
(Chawla, 2010).
21 of 65
Accuracy can be used as an overall measure of performance, and is simply the
number of correctly classified cases as a ratio of total cases. In the case of evenly
distributed target classes, accuracy is a solid measure of model performance, but it
should not stand alone because it neglects to assess a model’s ability to predict
positive and negative cases. Accuracy is defined as (Witten & Frank, 2005): Equation 7 - Accuracy
Accuracy = ! !"!!"!"!!"!!"!!"! (7)
Recall can be interpreted as how well the model is at finding all the cases within a
given class. Precision can be interpreted as how big a part of the cases within a given
class actually belong to that class. Precision and recall are of special importance, as
this tells us to what degree we can trust our model to classify positive cases correctly,
and depending on the task, one might want to prefer higher recall or higher precision.
Recall and precision are defined as (Witten & Frank, 2005): Equation 8 - Recall
Recall = ! !"!"!!"! (8)
Equation 9 - Precision
Precision = ! !"!"!!"! (9)
The F-measure seeks to balance precision and recall, and serves as a good
performance measurement in the case of an unbalanced dataset. If one trains a model
without getting an increase in performance on both recall and precision, one is just
adjusting the trade-off between precision and recall (Forman, 2003), whereas the F-
measure provides a single measurement of this trade-off. The F-measure is defined as
(Witten & Frank, 2005): Equation 10 - F-measure
F = ! !∗!"#$%%∗!"#$%&%'(!"#$%%!!"#$%&%'( (10)
22 of 65
3.4%6 The%power%of%different%text%classification%methods%
This section will identify and comment on binary text classification task literature.
Until now, we have identified and discussed different text mining and machine
learning processing and modelling tools. It is important to mention to our readers that
the aim of this literature study was never to make perfect guesses about the choice of
modelling parameters from the very beginning. Rather we wished too set boundaries
of the approach we will later take, to modelling the concept of an idea. In the studies
chosen we will identify the aim of study, processing methods, term weighting
schemes, feature selection methods, performance assessment measures and choice of
classification algorithm. We will rank the methods according to performance in the
given study. At the end of the section we comment on any shortcomings of the
studies and summarize our findings.
Starting with the most resent study, Yu (2008) showed that support vector
machines and naïve Bayes perform equally well classifying sentimental novels, while
naïve Bayes outperforms the support vector machines on an erotic poem task. The
setup was binary term weighting schemes for both classifiers, term frequency for the
naïve Bayes, normalized term frequency for the support vector machine and term
frequency inverse document frequency for the support vector machine. For
processing, stopword removal and stemming were used. For feature selection, the
support vector machine used support vector weights and naïve Bayes used odd ratio.
The authors argued that stemming eliminated highly discriminative terms, reducing
performance of the support vector machine.
Lai (2007) compared the support vector machine, naïve Bayes and k-nearest
neighbours for text classification, where the support vector machine outperformed all
other methods, while naïve Bayes performed consistently. The setup was variations of
stemming and stopword removal; it showed no significant effect of stemming and a
small improvement of stopword removal. Two weighting schemes were applied: one
weighting scheme was used as baseline and was not reported, whereas the term
frequency inverse document frequency score was reported together with the support
vector machine. This combination gave a slight improvement in performance.
Zhang, Zhu, and Yao (2004) investigated the performance of the five
classifications methods of support vector machines, maximum entropy model,
memory based learning, naïve Bayes and ada boost. Maximum entropy models
23 of 65
nearest neighbours and ada boost we will not comment on further, but we
acknowledge that they were a part of the study. The results showed that support
vector machine outperformed the other methods, while naïve Bayes underperformed.
The setup reported was document frequency, information gain and chi-square for
feature selection.
Sculley and Wachman (2007) did a spam classification study with support
vector machines, trying to prove that focussing too much attention on tuning C is
unnecessary, as it does not increase performance to a such degree that it can make up
for the extra computational costs incurred. The setup was no feature selection and
binary weighting scheme. This paper is relevant because the character versions of 3-
grams and 4-grams were a part of the setup. 4-grams gave the best performance,
supporting the use of n-grams in the modelling process.
Another spam classification task was undertaken by Webb, Chitti, and Pu (2005).
Here support vector machine, naïve Bayes and logit boosting algorithms were
compared against a well know spam filter called Spam Probe. The support vector
machine and logit boosting in general performed better than naïve Bayes, but only
with a small margin. The setup was information gain for feature selection and no
stemming and stopword removal were applied. We noticed that only one feature
selection method was applied and no weighting scheme was reported.
Critical assessment
Taking a critical stance on the reviewed studies, we noticed the lack of information
about feature selection methods in the Sculley and Wachman (2007) and Lai (2006)
papers. Even though we feel it is a downside of both studies we cannot claim that this
missing information is problematic, as one does not need to apply a feature selection
technique. However, missing information about the term weighting scheme in the
Webb et al. (2005) paper and the missing information about processing steps in Webb
et al. (2005) and Zhang et al. (2004) does deserve criticism, as the choice one makes
on these parameters can have a significant impact on classifier performance. It also
deserves critique that Yu (2008), Sculley and Wachman (2007) and Lai (2006) apply
accuracy as a metric of performance without stating the distribution of the target
classes in their datasets. An overview of the applied setup with regards to
classification algorithm, processing steps, choice of term weighting scheme, feature
selection method and performance measures is shown in Table 1.
24 of 65
Summary
From these studies we learned that the classifiers of support vector machines and
naïve Bayes seems to be widely applied with good results. There seems to be no
consistency in the text processing steps, whereas term frequency, inverse document
frequency score, and binary weighting scheme seem to be the preferred choice for
term weighting schemes. There seems to be a minor preference towards information
gain for feature selection, where the choice of the chi-square statistic for feature
selection is more doubtful because it is not supported by more than one study. For
performance measurement, the F-measure, accuracy, precision and recall are also
widely used. We realize that other performance measurements were also applied in
the studies, but we limit ourselves to the ones chosen.
25 of 65
Table 1 - Overview over studies reviewed
Paper Aim of study Classification algorithms Processing Term weight Feature
selection Performance
measures Performance
ranking
Yu (2008) Classification of novels and erotic poems
SVM NB
Stemming Stopword
BIN TF
NTF TF-IDF
SVM OR Accuracy NB
SVM
Sculley & Wachman
(2007) Spam SVM
Words 2-grams 3-grams 4-grams
BIN
Accuracy Precision
Recall F
SVM
Lai (2006) Spam SVM NB
K-NN
Stemming Stopword TF-IDF Accuracy
SVM NB
KNN
Webb et al. (2005) Spam
SVM NB LB
NA IG WAcc SVM LB NB
Zhang et al. (2004) Spam
SVM MEM ADA K-NN
NB
BIN DF IG
CHI
Precision Recall
F WAcc TCR
SVM MEM ADA NB
MBL Abbreviations: SVM = Support vector machine, NB = Naïve Bayes, K-NN = K-nearest neighbours, LB = Logit boosting, MEM = Maximum entropy model, ADA =
Ada boosting, BIN = Binary weighting scheme, TF = Term frequency, NTF = Normalized term frequency, TF-IDF = Term frequency inverse document frequency,
OR = Odds ratio, IG = Information gain, DF = Document frequency, CHI = Chi-square, WAcc = Weighted accuracy, TCR = Total cost ratio
26 of 65
3.5$% Summary$
In answer to sub question two, we state that the data created inside online
communities is characterized as being in an unstructured format, and is often big in
terms of volume, frequency and variety. This means that one must apply text mining
and machine learning in order to model the concepts of interest. One can organize
unstructured data by the mean of a bag-of-words. If one wishes to transform
unstructured textual data into structured data, then a variety of processing steps must
be undertaken, including pruning, tokenization, stemming, creation of n-grams and
choice of term weighting scheme. The results within the reviewed literature are
inconclusive when it comes to a choice between stemming, stopword removal and n-
grams, but the weighting schemes of term frequency inverse document frequency and
binary weighting scheme are well supported.
When applying machine learning one needs to be aware of imbalance in the
target variable. Class imbalance can be partially solved by random oversampling
and/or random undersampling. One must choose which feature selection methods to
include, as it can enhance performance and prevent over fit. Methods supported by
literature include information gain and the chi-square statistic. Also, one must choose
a classification algorithm where we pointed out support vector machines or naïve
Bayes as alternatives. To assess the performance of the trained classifier one must
decide which measures to assess, especially when assessing an unbalanced training or
test set. The F-measure especially makes a good performance measurement, but one
might also use accuracy, recall, precision and ROC assessment.
27 of 65
4$% Aims$of$study$
In light of the reviewed theory as well as our main and secondary research questions,
we have set up a study to assess how the concept of an idea can be captured inside a
target variable. Based on this target variable, we will use text mining and machine
learning to detect ideas generated in our online community of Lugnet. Having
detected the ideas, we will assess to what degree seasonality and historical events
influence idea generation inside our online community
Our study is divided into three parts. Firstly, we will create a training set that
captures the concept of an idea within its target variable. Secondly, we will use the
training set to train a classifier to detect ideas from messages contained within our
online community. In order to do so, we will choose the best settings for two
particular types of classifiers (Support vector machines and naïve Bayes), and finally
choose the best classifier given the settings chosen. Lastly, we will use the trained
model to detect ideas generated in the online community of Lugnet and determine to
what degree idea generation inside the online community can be explained by
seasonality and historical events.
28 of 65
5$% Method$
The first section of this chapter will describe how we mined the data from Lugnet.
The second section describes how the training set and target variable were created. As
mentioned already, ideas are complex and therefore it becomes relevant to assess the
reliability of the concept we extract. The importance of reliability should be seen
relative to the importance of having a training set of sufficient balance. We created
our target variable through several rounds of manual classification, and assessed
reliability through the measure of Cohens Kappa. The third section outlines how the
concept of an idea was modelled. That is how the right combination of weighing
schemes were selected, and how the processing steps and feature selection method in
combination with a classification algorithm were chosen. This process is important in
order to create the best possible model. The fourth and final section describe how we
used the model to filter the entire document collection and built a dataset for assessing
the variations of idea generation in the online community, and how and by which
means this was done.
5.1$% Mining$the$online$community$of$Lugnet$
At the point we downloaded the forum it had a volume of 529,040 messages and each
individual message contains a variety of information. An example of a message
displayed by a regular internet browser, is shown in Appendix A, whereas the same
message in .eml format and .txt format, is shown in Appendix B and Appendix C.
Each message contains the text together with the metadata belonging to each text. We
were only interested in the text, the unique identifier, and the date for each post. The
unique identifier is the Message-ID which for a random post is of the form
<[email protected]>, and the date are of the format Tue, 29 Sep 1998
18:51:35 GMT for the same random message. We would need the unique identifier in
order to merge our predictions from our model with the dates.
As mentioned the forum had a total number of 529,040 messages that we
stored as .eml files in different folders according to their sub forum. Lugnet’s
messages were downloaded as separate .eml files, and handled by the software
package R. The files were stored with the name of the subject of the individual
message. In order to give all messages their own unique identifier, we created a piece
29 of 65
of code in R that loads all .eml files from each folder and moves them into only one
folder. This process assigns each message with a name corresponding to their
“Message-ID” instead of their subject title (Feinerer & Kurt, 2012; Ingo Feinerer,
2012). Some of the messages posted contain citations, which is a piece of text that
refers to a topic in an earlier message posted. We decided to remove citations, as we
did not consider them as valid information. The removal of the citations was
accomplished through text mining software in R (Feinerer & Kurt, 2012; Ingo
Feinerer, 2012). This left us with a total of 440,036 messages, which is a reduction of
16.8%, happening automatically as each duplicate was overwritten in the process.
The additional pieces of information, shown in Appendix C, were removed during
this process, leaving only the Message-ID and the text without citations for further
analysis.
5.2$% Construction$of$target$variable$The concept of an idea is a complex concept. In light of this, the task of creating a
reliable target variable becomes important for the performance of the classifier. We
define our classification task as a binary classification task, as most of the reviewed
research has been within this particular type of task, as well as we consider it very
reasonable to handle our problem in this way.
We used human judges to create the training set, and Cohens Kappa statistic
(κ) to assess the reliability of our target variable. This measure was chosen because it
corrects for the agreement that occurs simply by chance, and what κ does is to adjust
this chance agreement and provide a measure of reliability for the concept assessed. κ
relies on three assumptions, that is (1) that units of observation are independent. (2)
that categories of the scale are independent, mutually exclusive and exhaustive. (3)
that judges operate independently (Cohen, 1960). We used the benchmark scale
suggested by Landis and Koch (1977) and define κ < 0 as poor, 0 < κ ≤ 0.20 as slight,
0.20 < κ ≤ 0.40 as fair, 0.4 < κ ≤ 0.60 as moderate, 0.60 < κ ≤ 0.80 as substantial, and
0.80 < κ ≤ 1 as almost perfect (Landis & Koch, 1977). The process of constructing the
target variable would result in a training set containing the classifications from each
judge and the Message ID as unique identifier.
The process of constructing the target variable was divided into five steps. In
the first step, we extracted 300 posts, classified them manually with the help of three
judges. In the second step, we built an intermediate model by modelling the 300 cases
30 of 65
training set we created. In the third step, we extracted 3,000 posts with the help of the
intermediate model we created. These posts were more than likely to contain an idea.
In the fourth step, we first had three judges classify 300 messages, in order to pick the
two judges with the highest pairwise κ and then these two judges would manually
classify the 3,000 messages. Finally in the fifth step, we assessed κ of the
classifications of the two judges. Each step is explained in details below.
Step 1: Initial assessment
In the first step we extracted 300 documents from a sub forum within Lugnet called
“Dear Lego”. The purpose of this sub-forum is to allow people to send open letters to
Lego, implying that there might be a higher likelihood that people will express ideas
within this sub forum, compared to the likelihood of extracting an idea outside the sub
forum. We had three judges (People connected to the project) to classify the
messages. We assessed κ in order to identify whether it is possible to extract ideas
with minimum a slight degree of reliability.
Step 2: Intermediate model
Based on the 300 classifications, we created an intermediate model to help increase
the likelihood of attaining a training set with a higher degree of balance. This model
was created with a binary weighting scheme, 3-grams, stopword removal, pruning
above 0.99 and below 0.01, information gain for feature selection and a linear support
vector machines optimized with regards to the soft-margin hyper parameter C.
Step 3: Extract 3,000 cases with improved event frequency
In this step we applied the intermediate model to the entire corpus of 440,036
messages and selected the 1,500 messages that were most likely to contain an idea,
and 1,500 documents are chosen at random. Before extracting the 3,000 messages, we
discarded all messages that were ± 2.5 standard deviations away from the mean with
regards to token number, which corresponds to the size of the messages. The
exclusion based on message length was done, because some messages were extremely
short and some were extremely long. The arguments for taking message length into
consideration was that reading long posts was more likely to tire our judges out and
short messages would contain no information.
31 of 65
Step 4: Manually classify 300 cases to assess classification reliability
In this step, three student helpers were recruited, whereas only two of them would get
to do the classifications of the 3,000 cases. First we created at dataset of 300 messages
and distribute them to each student helper. Each student helper was instructed to do
the classifications independently. We used κ to assess reliability between judges, and
the two judges with the highest pairwise κ, then did the rest of the classifications.
Step 5: Manually classify 3000 cases to construct the training set
Having chosen two of the three judges, these two judges then classified the 3,000
messages. Again both judges were instructed to do the classifications independently,
and again κ was used to assess the reliability of the judges. As this was the final step
in the process of creating our target variable, we would after this step have our
training set, which we could use for modelling the concept of an idea. We decided to
use the ideas identified by minimum one judge, as positive cases in our training set.
5.3$% Modelling$the$concept$of$an$idea$
Having created our training set, we used this for training our model, whereas this
section will describe which steps we performed, in order to build our model. We did
this in seven steps, which are described in detail in the on-going subsections.
5.3.1$% Data$exploration$
In order to get a feeling with our data, we used the feature selection method of
information gain to filter the twenty most discriminative terms between the two
classes of positives and negatives. This gave us an idea of how our judges had made
their decisions, and what terms are good predictors of whether a message contains an
idea. For the task of assessing the twenty most discriminative terms, we removed stop
words, extracted 3-grams and pruned above 0.99 and below 0.01. Hereafter, we
looked at the topics discussed in the extracted cases. We looked at the positive cases
and negative cases separately in order to assess if we could identify a pattern in what
people discuss, within ideas and none-ideas. We extracted four topics and five terms
for each topic. We used the topic models package in R in order to perform this
analysis and we removed stop words, we used stemming and we used term
occurrences for term weighting scheme (Grün & Hornik, 2011). We used the entire
sample of positive cases and negative cases.
32 of 65
5.3.2$% Data$partitioning$
As a first step, we partitioned our data and created two datasets, a training set and a
test set. Included in the training set was a validation set, and the test set was to be put
aside for assessing our candidate models. We did a 70/15/15 split, meaning that our
the training set consisted of 70% of the total cases, our validation set consisted of
15% of the cases, and our test set consisted of 15% of the cases. We used the two
validation techniques of 10-fold cross validation and split validation.
When we used cross validation, we undersampled the majority class to fit the
minority class and assessed performance on the validation set. When choosing among
candidate models, we undersampled as well, but we used split validation instead and
assessed performance on the test set, rather than the validation set.
As debated earlier, it is not obvious if either undersampling or oversampling
produce the best results, therefore we created a training set where we oversampled.
This training set we did not use before we had our final model. We stress that we
made sure that training set and validation set were kept separate from the test set
under the entire process and the test set was only used in the end for choosing among
candidate models.
5.3.3$% Classification$algorithms$
Based on the results presented in Table 1 as well as the arguments presented earlier,
we used the support vector machine and naïve Bayes as classification algorithms. We
did not expect the naïve Bayes to perform better than the support vector machine, but
as stated earlier, naïve Bayes can provide a baseline as well as achieve a high
performance relative to a small training set. We used the linear support vector
machine to choose modelling settings, and in the end we added a support vector
machine with an radial basis kernel. All our support vector machine classifiers were
trained by using a grid search to search for the optimal C and the optimal γ value.
5.3.4$% Term$weighting$scheme$
We could not assume that the problems mainly researched in the existing literature
(Classifying spam and poems) are similar to classifying ideas, so we needed to assess
which weighting scheme performs better on the particular task of detecting ideas. Our
setup included four different weighting schemes - binary weighting scheme, term
occurrences, normalized term frequency and term frequency inverse document
33 of 65
frequency. Neither could we expect the support vector machine and the naïve Bayes
algorithm to have the same preferences with regards to weighting schemes, so we
applied all four weighting schemes on both classifiers. We choose to apply stopword
removal, 3-grams and pruning above 0.99 and below 0.01 as the setup for each
individual weighting scheme, giving us a total of eight different scenarios and 2,241
features in all the settings. We assessed the performance of the weighting schemes by
comparing accuracies using a Bonferroni adjusted paired t-test. We denoted the mean
accuracy !!and !! respectively. In order to extract the accuracy measures we applied
ten-fold cross validation and based on these measures, we calculated the mean
difference denoted as ! for each weighting scheme and tested if ! was significantly
different from zero (Yu, 2008). We then choose the single best weighting schemes for
each classifier.
5.3.5$% Data$processing$steps$
For choosing among processing steps we applied the best term weighting scheme
from the previous step to each classifier. We focused on the two processing steps of
n-grams and stemming. We used n-grams to extract more semantics, and we decided
to use 3-grams which also includes 2-grams and uni-grams. We removed stop words
to get a lower number of terms and thereby reduce computational costs, as well as
filter out noise. This led to three setups, which were only stemming, only 3-grams and
neither of them. We pruned above 0.99 and below 0.01 and we removed stopword in
all settings. We applied a paired t-test with Bonferronis correction, as with the term
weighting schemes, and we picked one processing setup out of the three for each
classifier. This left us with two different classifiers, with an individual weighting
scheme and an individual processing setup to continue with.
An important point about this approach is that we also applied a feature
selection technique in order to keep the feature level constant. We have earlier
discussed that classifier performance can be influenced by the number of features, and
as we faced different levels of features given 3-grams (2,241 features), stemming
(1,018 features) and none of these (1,234 features), we needed to keep the feature
level constant in order to avoid the potential effect of feature number on classifier
performance. We did this by using information gain for feature selection to all
settings and only use the top 1,018 features.
34 of 65
5.3.6$% Feature$selection$methods$
For feature selection we used information gain and the chi-square statistic. We
assessed performance in terms of accuracy, recall and precision, and set up a grid
search to pick the best percentage threshold of features for each classifier. This gave
us two setups for each classification algorithm, from which we picked the best
performing feature selection technique for each classifier.
5.3.7$% Choice$of$final$model$
Having assessed the performance given term weighting scheme, processing steps, and
feature selection methods, we had the settings of how to train two classifiers. As it has
been debated if the support vector machines actually benefits from feature selection,
we also set up a linear support vector machine without any feature selection
technique, and a support vector machine with radial basis kernel, also without any
feature selection technique. This gave us a total of four classifier setups, of which the
first was a naïve Bayes with a given feature selection method, the second was a linear
support vector machines with a given feature selection method, the third and the
fourth was a linear support vector machines and a support vector machine with an
radial basis kernel, with no feature selection technique. We assessed ROC, F-measure,
recall, precision and accuracy on the test set, and we used split validation with the
undersampled training set and applied the optimal feature threshold from the previous
step on the classifiers, which utilize feature selection.
Based on the assessment of the four candidate models, we picked our final
model. Instead of using the under-sampled training set, we used the oversampled
training set, and assessed if there were any improvement in performance. Having
chosen our final model we applied the model to all messages from the forum, giving
us a dataset with a date variable and a prediction variable, allowing us to determine to
what degree variability in idea generation within our online community can be
explained by month and/or year over a given time period.
5.4$% Effect$of$seasonality$and$historical$events$on$idea$generation$How can idea generation within crowdsourcing communities be explained by
seasonality and historical events? In this section we will answer this question by (1)
exploring the data in the same manner as earlier; (2) creating our dataset and handle
missing data; (3) assessing the nature of our variables and perform the necessary
35 of 65
variable transformations; (4) defining the regression model; and (5) assessing
parameter estimates and goodness-of-fit.
We performed the same data exploration as we did after extracting the training
set. In particular we extracted an even number of positive and negative cases and
assessed the twenty most discriminative terms. This second comparison is relevant for
two reasons. Firstly we could assess if our model actually extracted the same pattern
as our judges, and secondly we could assess the difference between messages
containing ideas and non-ideas. We extracted a sample of 2,500 ideas and 2,500 non-
ideas and used the same setup as earlier.
Having trained a model, we applied the model to the entire corpus of 440,036
messages in order to detect the ideas in the entire forum. Based on the dates of the
particular messages, a dataset was created containing a count of ideas and a count of
total messages posted within a given month and year. This gave us a dataset with the
four variables, MONTH, YEAR, IDEA and ACTIVITY, where IDEA and
ACTIVITY are counts of messages containing ideas and total message count
respectively. In this step we also assessed missing data and omitted years with too
many missing months.
In order to adjust for the relationship between number of ideas created and
number of messages posted on the forum, we derived a variable which we named
event rate per month (ERPM), by calculating IDEA as a ratio of ACTIVITY for each
row in our dataset. To create a dependent variable that fulfilled the requirements of
linear regression, we did a logit transformation on ERPM (LN.IDEA), giving us a
continuous dependent variable and two categorical independent variables, MONTH
and YEAR.
Since the forum has been inactive for the last four years, we omitted these
years from our analysis, then we assessed if our model fulfilled the assumptions of
linear regression. We assessed model assumptions by inspecting how the error terms
were distributed in a standard residuals plot and a histogram of the residuals. After
this we defined our model.
Finally we assessed the results of our regression model, where we reported the
total amount of variance explained (!!). We reported coefficient estimates and their
corresponding p-values.
36 of 65
6$% Results$
6.1$% Reliability$of$the$manual$classification$of$target$variable$
The initial assessment of the coding scheme was performed on 300 randomly selected
posts from the ‘Dear Lego’ sub-forum. Judge one classified 10% of the posts to be
containing an idea, judge two classified 6.7%, and judge three classified 3%. The
average number of ideas extracted was 6.6%. Calculating κ, we found that κ = 0.391 ±
0.18 at α = 0.05 for judge one and judge two, κ = 0.382 ± 0.193 at α = 0.05 for judge
one and judge three, and κ = 0.604 ± 0.21 at α = 0.05 for judge two and judge three,
which we considered substantial agreement.
Based on the classifications from the initial assessment, we constructed an
intermediate model to extract 3,000 messages with higher likelihood of containing an
idea than if sampled at random. Of these 3,000 messages, 300 were selected at
random and distributed to three new judges. Of the 300 cases, judge one classified
13.3%, judge two classified 6.7%, and judge three classified 11% as containing an
idea. (Average = 10.33%). κ = 0.451 ± 0.161 at α = 0.05 for judge one and judge two,
κ = 0.444 ± 0.173 at α = 0.05 for judge two and judge three and κ = 0.486 ± 0.15 for
judge one and judge three at α = 0.05, which we consider moderate reliability.
We received the remaining cases with a minor note from both judges, saying
that two posts were duplicates of other posts, despite having different Message-ID’s.
These two posts were excluded, leaving us with a training set of 2,998 cases. From
the 3,000 cases, judge one classified 8.7% of the cases as containing an idea, and
judge two classified 6.90% as containing an idea. Calculating the average of ideas
extracted yielded an average of 7.84% compared to 6.56% from the messages
extracted in the initial assessment, which were extracted from a sub-forum with a
higher likelihood of ideas occurring. For the two remaining judges κ = 0.548 ± .056 at
α = 0.05, which we consider moderate agreement. As a final comment we note that
both judges agreed on 137 positive cases, whereas one judge of the two considered
200 cases as an idea. This gave us a training set of 337 positive cases and 2,661
negative cases. We decided to continue with as many positive cases as possible as we
considered a training set with 137 positive cases too small.
37 of 65
Summary of results
The important results for this part of the study include the training set of 2,998 cases,
with 337 of them positive, and a κ statistic on 0.548 ± .056 at α = 0.05, which we
consider as moderate reliability.
6.2$% Detecting$ideas$Having created a training set with a reliable target variable, we built our model by (1)
partitioning the data based on the distribution of our target class; (2) performing an
exploratory analysis to assess potential patterns in the data; (3) assessing the
consequences of varying the term weighting scheme, (4) assessing the consequences
of varying the processing steps; (5) assessing the consequences of varying the feature
selection methods; and finally (6) assessing the performance of our candidate models
and picking the best performing model.
6.2.1$% Data$partitioning$
Our training set was 2,998 cases large with the distribution of 337 positive cases and
2,661 negative cases. This gave us an unbalanced dataset, as the positive ratio was
approximately 11:100. Based on a 70/15/15 split we achieved a training set with 236
positive cases and 1,863 negative cases. For the validation and test set this would
leave 51 positive cases and 399 negative cases, respectively. Applying random
undersampling our cross validation training set had 286 positive cases and 286
negative cases, while using random oversampling, it had 2,262 positive cases and
2,262 negative cases. The test set we left untouched no matter what technique we
applied in order to balance the dataset.
6.2.2$% Exploratory$analysis$
Before doing the actual modelling, we explored the data extracted. Table 2 shows the
twenty most discriminative terms or n-grams, as well as four topics extracted from the
pool of positive cases and four topics extracted from the pool of negative cases.
38 of 65
Table 2 - Twenty discriminative terms and four positive and negative topics from training set
Top$20$discriminative$terms$
1"to"5" 6"to"10" 11"to"15" 16"to"20"would_be" that_would" it_would" tlg"
lego" you_could" com" like_to_see"idea" i_would" ideas" to_see"
could_be" see" etc" that_would_be"sets" nice" theme" be_a"
Idea$topics$ None4idea$topics$
Topic"1" Topic"2" Topic"3" Topic"4" Topic"1" Topic"2" Topic"3" Topic"4"Idea" Robotics" Trains" Lego" Lego" Lego" Robotics" Construct"lego" sensor" train" brick" peopl" lego" robot" brick"build" robot" model" lego" dont" build" train" build"idea" motor" build" color" time" post" build" piec"
product" control" track" build" lego" time" program" lego"castl" program" look" piec" that" dont" lego" plate"
From Table 2 we can see that many of the discriminative terms are n-grams. This
makes sense because when people have an idea or are being creative they use words
like “could be”, “would be”, “you could”, “like to see” and obviously “idea”. The text
piece below shows an example of a message containing some of the identified terms.
That's an awesome idea!
You could make a Buddha, or a Renaissance man!
That would be totally cool.
If we assess the results of the topic modelling we labelled the idea-topics “Idea”,
“Robotics”, “Trains” and “Lego”. The none-idea topics we labelled “Lego”, “Lego”,
Robotics and “Construct”. We will not claim that there is a clear distinction between
idea topics and none-idea topics. For example the topic of robotics seems to be widely
debated, in an idea and a none-idea domain.
6.2.3$% Classifier$performance$given$term$weighting$and$processing$steps$
When we assessed the results of the support vector machines we found that the
support vector machine with normalized term frequency performed significantly
better than the support vector machine with binary weighting scheme. We discovered
39 of 65
that the support vector machine with term frequency inverse document frequency
performed significantly better than the support vector machine with binary weighting
scheme. The support vector machine with term frequency inverse document
frequency performed significantly better than the support vector machine with term
occurrences. From these results we could exclude the support vector machine with
binary weighting scheme and the support vector machine with term occurrences, but
we decided to proceed with the support vector machine with term frequency inverse
document frequency. Our reason for this was that the support vector machine with
normalized term frequency fails to achieve a significant difference when compared to
the support vector machine with term occurrences.
When assessing the results for the naïve Bayes classifiers we found that the
naïve Bayes with binary weighting scheme performed significantly better than the
naïve Bayes with normalized term frequency. We also discovered that the naïve
Bayes with binary weighting scheme performed significantly better than the naïve
Bayes with term frequency inverse document frequency. From these results we
excluded the naïve Bayes with normalized term frequency and the naïve Bayes with
term frequency inverse document frequency, whereas we decided to proceed with the
naïve Bayes with binary weighting scheme, due to the fact that we failed to obtain any
significant performance measures with regards to the naïve Bayes with term
occurrences. The results of the mean comparisons for the ten-fold cross validation are
reported in Table 3.
6.2.4$% Classifier$and$term$weighting$scheme$given$processing$steps$
Assessing the results of the support vector machine with term frequency inverse
document frequency, we found that the support vector machine with term frequency
inverse document frequency and 3-grams performed significantly better than the
support vector machine with term frequency inverse document frequency and
stemming. Moreover, we determined that the support vector machine with term
frequency inverse document frequency with no processing step performed
significantly better than the support vector machine with term frequency inverse
document frequency and stemming. From these results we excluded the support
vector machine with term frequency inverse document frequency and stemming, but
our results could not distinguish between the support vector machine with term
frequency inverse document frequency with 3-grams and the support vector machine
40 of 65
with term frequency inverse document frequency and no processing step. However,
we decided to proceed with the support vector machine with term frequency inverse
document frequency and 3-grams. Our reason for this was that the argument for
choosing the support vector machine with term frequency inverse document
frequency with no processing step would be a less likelihood of over fit due to the
lower number of features. But as we in the next step considered feature selection and
feature selection thresholds we preferred keeping in as much information as possible
at that point.
Assessing the results of the naïve Bayes with binary weighting scheme, we
found that naïve Bayes with binary weighting scheme and stemming performed
significantly better than the naïve Bayes with binary weighting scheme and 3-grams.
We also found that the naïve Bayes with binary weighting scheme and stemming
performed significantly better than the naïve Bayes with binary weighting scheme and
no processing step. From these results we excluded the naïve Bayes with binary
weighting scheme and 3-grams and the naïve Bayes with binary weighting scheme
and no processing step, and we decided to proceed with the naïve Bayes with binary
weighting scheme and stemming. We discovered that the naïve Bayes classifier
seemed to be performing better with a dimensionality reduction method, which
stemming can be considered as. The results of the mean comparisons for the 10-fold
cross validation are reported Table 3.
6.2.5$% Assessing$performance$given$varying$feature$selection$methods$
Given the results so far we decided that the support vector machine should be
modelled with the term frequency inverse document frequency as a weighting scheme
and with 3-grams as a processing step. With regards to the naïve Bayes algorithm, we
decided that this classifier should be modelled with binary weighting scheme and
stemming as a processing step.
Assessing the results of the support vector machine with term frequency
inverse document frequency and 3-grams given information gain for feature selection,
we found that the optimal feature percentage threshold was 90% features. At this
feature threshold the support vector machine with term frequency inverse document
frequency and 3-grams performed accuracy = 0.892 , recall = 0.902 and
precision = 0.885 . When using chi-square statistic for feature selection we
calculated accuracy = 0.873, recall = 0.863 and precision = 0.880. We decided to
41 of 65
proceed with information gain for feature selection as this method performed better
than with the chi-square statistic.
Assessing the results for the naïve Bayes with binary weighting scheme and
stemming with information gain for feature selection we discovered that the optimal
feature percentage threshold was 90% features. At this feature threshold the naïve
Bayes with binary weighting scheme and stemming performed accuracy = 0.833,
recall = 0.882 and precision = 0.804 . Given chi-square statistic for feature
selection we calculated accuracy = 0.833, recall = 0.882 and precision = 0.804. It
serves as a note that these results were completely alike, which might seem a bit
strange. But if we recall that the dataset extracted by the means of stemming gave
1,018 features, it is not unreasonable that the two feature selection methods
(information gain and chi-square statistic) have the same 10% features which
contribute the least to performance, and thereby give similar results. As chi-square
statistic is less computationally expensive we decided to continue with this feature
selection method.
42 of 65
Table 3 - Results term weighting, processing and feature selection assessment
Term$weighting$Comparison" !!!" !!" !" pCvalue"
SVM_NTF"vs."SVM_BIN" 0.811" 0.778" 0.033" 0.048$SVM_TFCIDF"vs."SVM_BIN" 0.824" 0.778" 0.046" 0.019$SVM_TFCIDF"vs."SVM_NTF" 0.824" 0.811" 0.012" 1.000"SVM_TO"vs."SVM_BIN" 0.766" 0.778" 0.012" 1.000"SVM_TO"vs."SVM_NTF" 0.766" 0.811" 0.045" 0.139"SVM_TO"vs."SVM_TFCIDF" 0.766" 0.824" 0.058" 0.015$NB_NTF"vs."NB_BIN" 0.645" 0.694" 0.049" 0.015$NB_TFCIDF"vs."NB_BIN" 0.643" 0.694" 0.051" 0.023$NB_TFCIDF"vs."NB_NTF" 0.643" 0.645" 0.002" 1.000"NB_TO"vs."NB_BIN" 0.689" 0.694" 0.005" 1.000"NB_TO"vs."NB_NTF" 0.689" 0.645" 0.044" 0.064"NB_TO"vs."NB_TFCIDF" 0.689" 0.643" 0.046" 0.150"Processing$steps$
Comparison" !!" !!" !" pCvalue"SVM_TFCIDF_NONE"vs."SVM_TFCIDF_3G" 0.820" 0.820" 0.000" 1.000"SVM_TFCIDF_STEM"vs."SVM_TFCIDF_3G" 0.771" 0.820" 0.049" 0.022$SVM_TFCIDF_STEM"vs."SVM_TFCIDF_NONE" 0.771" 0.820" 0.049" 0.004$NB_BIN_NONE"vs."NB_BIN_3G" 0.706" 0.692" 0.014" 0.454"NB_BIN_STEM"vs."NB_BIN_3G" 0.736" 0.692" 0.044" 0.007$NB_BIN_STEM"vs."NB_BIN_NONE" 0.736" 0.706" 0.030" 0.004$
Comments: These tables display results of the different assessments regarding term weighting schemes,
processing steps and feature selection methods. Regarding term weighting scheme and processing step,
one can read the comparison column as !!vs. !! and a positive ! value corresponds to !! performing
higher accuracy than !!. One can interpret the p-value as the degree to which one can be certain that
the true mean accuracy is !.
Abbreviations: SVM = Support vector machine, NB = Naïve Bayes, NTF = Normalized term frequency,
TF-IDF = Term frequency document frequency, BIN = Binary, TO = Term occurrences, NONE = No
processing step, 3G = 3-grams, STEM = Stemming
43 of 65
6.2.6$% Assessing$candidate$models$
Having excluded and chosen term weighting schemes, processing steps and feature
selection methods for our two classifiers, we trained the classifiers and assessed the
classifiers on our test set. Recall that we decided to add a support vector machine with
a radial basis kernel, as well as a linear support vector machine with no feature
selection method. This gave us four setups - a linear support vector machine, a
support vector machine with a radial basis kernel, a linear support vector machine
with information gain for feature selection and a naïve Bayes with the chi-square
statistic for feature selection. The processing setups for the support vector machines
were term frequency inverse document frequency weighting scheme and processing
by 3-grams and stopword removal (Yields 2,241 features), and for the naïve Bayes it
was a binary weighting scheme with stemming (Yields 1,018 features). A ROC chart
of the four classifiers displayed in Figure 1
Figure 1 - ROC chart of candidate models
From the ROC chart one can see that, especially in the beginning, the naïve Bayes
classifier with the chi-square statistic as the feature selection technique performed the
worst. The two linear support vector machines performed more or less equally,
whereas the support vector machine with the radial basis kernel performed the best.
The numerical assessment based on the undersampled training set of the
classifiers is displayed in Table 4. We decided to proceed with the linear support
vector machine, due to the fact that this model has the best F-measure. We recognized
that the ROC assessment is in support of the support vector machine with the radial
44 of 65
basis kernel, but this model was not very precise, and as we weighed the combination
of recall and precision higher than the true positive rate, we decided to continue with
the linear support vector machine.
We then applied our oversampled training set, in order to see if we could
enhance performance of the linear support vector machines by adding more
information in terms of the excessive negative cases we had available. We compared
the model trained on the oversampled training set and the original under sampled
training set. The ROC assessment of these two models is displayed in Figure 2.
Figure 2 - ROC chart of support vector machines with an under- and oversampled training set
The above ROC assessment shows that the linear support vector machine with the
oversampled training set performed slightly better than the linear support vector
machine with the undersampled training set. We do not claim that the difference in
performance is large, but it is noteworthy. The numerical assessment of the
undersampled and oversampled models is displayed in Table 4 as well as the counts
of true positives, false positives, false negatives and true negatives. The model
performed mediocre on recall and precision. But as these two measures are balanced,
the model achieved a relative high performance on the F-measure, which is
noteworthy compared to the earlier candidate models. Finally we note that this
classifier achieved accuracy = 0.911 which is also the highest among the models we
trained. Therefore we select the linear support vector machine with an oversampled
training set as our final model.
45 of 65
Summary of results
The most important results of this section is that we trained a model with a linear
support vector machine with term frequency inverse document frequency as term
weighting and terms 3-grams. We used no feature selection method and we used
oversampling. This particular model perform F = 0.608 , recall = 0.608 ,
precision = 0.608, and accuracy = 0.911. All results are displayed in Table 4. Table 4 - Results of candidate models performance
Performance$measure$
SVM$4$Linear$ SVM$4$RBF$ SVM$4$Linear$IG$ NB$4$CHI$ SVM$4$Over$
FCmeasure" 0.515" 0.385" 0.490" 0.411" 0.608"Recall" 0.824" 0.922" 0.745" 0.726" 0.608"Precision" 0.375" 0.244" 0.365" 0.287" 0.608"Accuracy" 0.824" 0.667" 0.824" 0.764" 0.911"TP"#" 42" 47" 38" 37" 31"FP"#" 70" 146" 66" 92" 20"FN"#" 9" 4" 13" 14" 20"TN"#" 329" 253" 333" 307" 379"
Abbreviations: SVM - Linear = Linear support vector machine with undersampled training set, SVM -
RBF = Support vector machine with radial basis kernel with under sampler training set, SVM - Linear
IG = Linear support vector machine with information gain for feature selection method with under
sampler training set, NB - CHI = Naïve Bayes with chi-square statistic for feature selection with under
sampler training set, SVM- Over = Linear support vector machine with oversampled training set, TP #
= True positive count, FP # = False positive count, FN # = False negative count, TN # = True
negative count
6.3$% Effect$of$seasonality$on$idea$generation$This section will be divided into five sections. In the first section, we describe the
results of the explorative analysis classified by our model. In the second section, we
describe how we created our regression data set and how we handled missing data. In
the third section, we explore our regression variables and explain how we transformed
the dependent variable. In the fourth section, we define the regression model and in
the fifth section, we assess the results of our regression analysis.
6.3.1$% Exploratory$analysis$
Table 5 shows the twenty most discriminative terms and n-grams as well as the four
topics extracted from the poll of ideas and four topics extracted from the poll of none-
ideas.
46 of 65
Table 5 - Twenty discriminative terms and four positive- and negative topics from prediction set
Top$20$discriminative$terms$1"to"5" 6"to"10" 11"to"15" 16"to"20"
would_be" that_would" it_would" tlg"lego" you_could" com" like_to_see"idea" i_would" ideas" to_see"
could_be" see" etc" that_would_be"sets" nice" theme" be_a"
Idea$topics$ None4idea$topics$Topic"1" Topic"2" Topic"3" Topic"4" Topic"1" Topic"2" Topic"3" Topic"4"Robotics" Lego" Color" Idea" Writing" Writing" Robotics" Lego"motor" lego" black" idea" set" wrote" dat" lego"sensor" brick" space" lego" brick" write" wrote" lugnet"robot" piec" piec" build" wrote" look" motor" site"control" color" white" wrote" lego" lego" file" post"lego" castl" reduc" post" write" time" dont" version"
We see the same pattern in the discriminative terms as displayed earlier in Table 2,
where n-grams as “would_be” and “could_be” etc., are the most discriminative terms.
Regarding the results of the topic modelling, topic one of the idea topics is robotics.
The robotics topic is also a topic within the none-ideas.
6.3.2$% Creating$dataset$and$handling$missing$data$
In order to predict which posts contains an idea we applied the linear support vector
machine with the oversampled training set to our entire document collection. In the
process of merging the predictions with the date information based on Message-ID,
624 messages were removed because they did not contain the necessary information
to merge3. This left us with 439,412 observations, which we collapsed by month and
year, which yields a dataset of 206 observations from January 1995 to November
2012. We noted that the time period from January 1995 to September 2012 yields
more than 206 months, but in the start-up period of the forum, there were several
months with no activity. This was a problem, which we dealt with in the next step.
Besides the two variables, YEAR and MONTH, the dataset contained a variable
3 When we searched for the errors causing the 624 missing messages we discovered that the structure of the .eml files is not consistent. Therefore the R routine created to extract the meta information, was not able to detect the Message-ID for the missing 624 posts.
47 of 65
named ACTIVITY which was a count measure of how many messages were posted in
a given month and year. The data set also contained a fourth variable named IDEA
which was a count measure of how many messages were posted in a given month and
year, that contained an idea. This resulted in a total of four variables in the initial data
set. An example of an observation in this data set is 5,349 (ACTIVITY) posts were
written during February (MONTH) 2002 (YEAR), and 184 (IDEA) of these posts
were classified as containing an idea.
6.3.3$% Variables$exploration$and$variable$transformations$
We started by exploring the YEAR variable. There were twelve observations for each
year from 1996 to 2011, equivalent to twelve months per year. For the year 1995 and
was only three observations. For 2012, there were eleven observations because the
messages were downloaded in November 2012. As we could not explain the low
number of observations in the year 1995, we omitted this year from the analysis. Our
argument for doing so was that so few observations for a given unit of analysis would
be to sparse to model. For the years 1996, 1997 and 1998 there were only very few
observations, and even fewer ideas, which meant that we also omitted these years
from the analysis. This left us with 167 observations evenly distributed between the
years of 1999 and 2011, whereas we had only 11 observations for 2012. Figure 3
illustrates the activity and idea generation inside the forum between January 1999 and
November 2012.
The histograms in Figure 4 displays the distribution of ACTIVITY and IDEA.
Both are dominated by low values, which we consider reasonable, especially in the
light of the relative length of the period where the forum has been considered inactive.
As it is very reasonable to assume that there is a relationship between ACTIVITY and
IDEA, we derived a variable where we accounted for this relationship event rate per
month (ERPM), which is the IDEA count for a given month within a given year as a
ratio of ACTIVITY for the corresponding month within a given year. We define
ERPM as: Equation 11 - Event rate per month
!"#$ = ! !"#$!"#$%$#& (11)
48 of 65
Figure 5 shows that ERPM is right skewed, and that the variability of ERPM within
each year seems to grow over time. We decided to omit the years 2012, 2011, 2010
and 2009. Our argument for doing so, was that we assumed that the large variability
within these years is a consequence of to few posts (i.e. the community is dead). This
meant that instead of 167 observations from 1999 to 2012, there were 120
observations from 1999 to 2008. In order to normalize our dependent variable we
applied a logit transformation to ERPM. This gave us a new variable that we named
LN.ERPM which we define as: Equation 12 - Logit transformed event rate per month
LN.ERPM = ln !"#$!!!"#$ (12)
Figure 6 shows that LN.ERPM is normally distributed. As a final note all of the
descriptive statistics of our variables are displayed in Appendix D.
6.3.4$% Defining$model$and$assessing$model$assumptions$
We created our regression model with YEAR and MONTH as predictor variables: Equation 13 - Regression model
y = a+ b! ∗MONTH!"# + b! ∗MONTH!"# + !… !!+ b!" ∗MONTH!"# +! (13)
b!" ∗ YEAR!""" + !b!" ∗ YEAR!""" + !… !!+ !b!" ∗ YEAR!""# + !ϵ!
From Figure 7 we see that the mean of LN.ERPM given month was stable, whereas
January, February and July had higher means than the rest of the months. With
regards to LN.ERPM for a given year, we noticed the large fluctuations in the mean
of LN.ERPM for the years of 1999, 2006 and 2008. There was a high variability in
the year 2006, low variability in the year 2007 and then again a large variability in
2008. From the two plots in Figure 8 we see that there were an equal variance of
residuals and normal distributed residuals.
49 of 65
Figure 3 - Fluctuations in ACTIVITY and IDEA given YEAR
Figure 4 - Histograms of ACTIVITY and IDEA
Figure 5 - Histogram of ERPM and box plot of ERPM from 1999 to 2012
50 of 65
Figure 6 - Histogram of LN.ERPM
Figure 8 - Residuals plot and histogram of residual distribution
Figure 7 - Fluctuations in LN.ERPM given MONTH and YEAR
51 of 65
6.3.5$% Parameter$estimates$and$goodness%of%fit$
The results of the regression are shown in Table 6. We used March and 2005 as the
baseline meaning that the intercept can be interpreted as the predicted value of
LN.ERPM if month is March and Year is 2005.
Table 6 - Regression results
Coefficients$ Estimate$ Std.$Error$ t$ value$ Pr(>|t|)$(Intercept)" C3.588" 0.131" C27.447" 0.000" ***"MONTHJan" 0.381" 0.140" 2.726" 0.008" **"MONTHFeb" 0.252" 0.140" 1.803" 0.074" ."MONTHApr" 0.074" 0.140" 0.529" 0.597"
"MONTHMay" 0.212" 0.140" 1.520" 0.132""MONTHJun" C0.006" 0.140" C0.046" 0.963""MONTHJul" 0.368" 0.140" 2.632" 0.010" *"
MONTHAug" 0.200" 0.140" 1.429" 0.156""MONTHSep" 0.217" 0.140" 1.550" 0.124""MONTHOct" 0.126" 0.140" 0.899" 0.371""MONTHNov" 0.207" 0.140" 1.480" 0.142""MONTHDec" 0.180" 0.140" 1.287" 0.201""YEAR1999" 0.446" 0.128" 3.494" 0.000" ***"
YEAR2000" C0.031" 0.128" C0.245" 0.807""YEAR2001" C0.169" 0.128" C1.325" 0.188""YEAR2002" C0.059" 0.128" C0.465" 0.643""YEAR2003" C0.174" 0.128" C1.361" 0.177""YEAR2004" 0.004" 0.128" 0.028" 0.978""YEAR2006" C0.283" 0.128" C2.220" 0.029" *"
YEAR2007" C0.268" 0.128" C2.098" 0.038" *"YEAR2008" C0.241" 0.128" C1.886" 0.062" ."Significance$codes:$$0$'***'$0.001$'**'$0.01$'*'$0.05$'.'$0.1$'$'$1$$
Residual$standard$error:$0.313$on$99$degrees$of$freedom$
Multiple$!!:$0.401$F4statistic:$3.376$on$20$and$99$DF,$$p4value:$0.0001$$
!! indicates that our model explains 0.401 of the overall variance in the dependent
variable (p < 0.0001)" level. This we consider as a noteworthy amount. The
coefficients of the months January, February and July are significant compared to the
baseline. With regards to years the coefficients of 1999, 2006, 2007 and 2008 are
significant when compared to the baseline.
52 of 65
Based on these observations we make the assumption that that Christmas and
summer holidays have an effect on idea generation in this domain. The reason for this
might be that people generate ideas when they get new toys and have time to play
with them.
53 of 65
7$% Discussion$&$Conclusion$Main research question
• How are ideas generated in online communities and how can one detect these
ideas by applying text mining and machine learning?
Idea generation in online communities can be seen as a type problem solving process.
Online communities allow large groups of people to interact, which sometimes results
in novel ideas being generated. Ideas are a product of a creative process. This process
requires the individual to go through several phases before the individual or group
generates an idea or a solution. The creative ability of the individual is mainly
determined by domain knowledge and intrinsic and extrinsic motivation, whereas the
creative outcome or the idea can take many shapes, and so be quite difficult to assess.
Data created in online communities are typically in an unstructured form. The
data are often “big” in terms of volume, frequency and variety, necessitating the use
of text mining and machine learning techniques. One can organize the unstructured
data by the mean of a bag-of-words model, and in order to transform the unstructured
textual data into structured data, it is necessary to perform a variety processing steps
(pruning, tokenization, stemming the creation of n-grams and choice of term
weighting scheme). When applying machine learning techniques one needs to be
aware of imbalance in the target variable which can be solved by random
oversampling and/or random undersampling. Choice of feature selection methods can
influence performance and prevent over fit. One will also have to choose the
appropriate classification algorithm – the support vector machine provides a state of
the art algorithm but naïve Bayes also is a reasonable alternative. To assess the
performance of the trained classifier, one will have to decide which measures to
assess. The F-measure especially makes a good performance measurement for skewed
datasets, but one might also apply accuracy, recall, precision and ROC assessment.
In order to create a training set for machine learning, one can capture idea
generation by extracting messages from an online forum in textual format, and have
judges manually classify these messages as containing an idea or not. In our particular
study we managed to extract 337 ideas out of 2,998 cases with reliability measure of
54 of 65
0.548 ± .056 at α = 0.05, which we turned into a classification model characterized by
a linear support vector machine where we used term frequency inverse document
frequency as term weighting scheme, terms 3-grams and stopword removal as well as
an oversampled training set. In our particular study this resulted in an F-measure of
0.608, a recall of 0.608, a precision of 0.608 and accuracy of 0.911 for our final
model.
Secondary research question
• To what degree do seasonality and historical events influence idea generation
inside online communities?
From our regression study we learned that 0.401 of the variation in our target variable
could be explained by seasonality and historical events. In particular, we noticed the
significant deviation of January, February and July. Implying that Christmas and
summer holidays enhance idea generation in our Lego case. We also learned that the
effect on idea generation of the years 1999, 2006, 2007 and 2008 was significantly
different from baseline. This implies that other factors than seasonality might have
played a role in the forums ability to create ideas seen over a period of years.
Implications of method for creating training set
Our approach to creating the training set gave us a reasonably sized and balanced
training set. However, the method of building an intermediate model and using it to
extract a higher number of positive cases has implications. The intermediate model
we trained was developed from 12 positive cases and 288 negative cases, a rather
small, and skewed data set. This may have created bias towards a certain type of idea,
which is a downside of our method. However, we did adjust for this problem as we
only used the intermediate model for extracting 1,500 cases, whereas the rest 1,500
cases were chosen at random.
Another issue related to the training set is our decision to classify a message as
a positive case if at least one judge had classified the message as an idea.
Alternatively, we could have set the threshold higher by required both judges to agree.
This would have yielded even higher skewness in our target variable, and a problem
of fewer event cases.
55 of 65
Business implications of study
From our study we learned that ideas created inside an online community can be
detected through text mining and machine learning. People tend to use certain
expressions (n-grams) when being creative. The fact that people write these
expressions enables us to detect the ideas by the means we propose. We identified this
pattern in our explorative studies, by applying an information gain criteria to filter the
twenty most discriminative terms. Many of these terms were n-grams. This confirms
that n-grams are more informative than single words, when one seeks to extract
meaning from text. Reflecting upon this we believe that the task of detecting ideas in
online communities via machine learning is primarily a task of extracting semantics.
Our method will allow organizations to filter ideas from old as well as new
data sources. For example, an organization with a corporate Facebook page would be
able to download all the messages posted on its Facebook page and use our model to
filter the posts containing ideas. As debated earlier, crowdsourcing often requires a
specific software platform, designed to allow a crowd to solve problems. This
platform relies on the crowd to do the filtering, and as we demonstrated, having such
platform might not be necessary. Our model will allow organizations to utilize the
wisdom of the crowd based on other platforms as for example a corporate Facebook
page or a message board as Lugnet.
If e.g. Lego would like implement such model Lego would first need to train
the model. Next Lego would need to download, the text content they would like to
apply the model to, and do the text processing steps described in this thesis. Based on
these two steps, one can apply the model to the new text data and filter messages,
which the model classifies as an idea. As the model is not perfect, it would be
necessary to have human judges sorting the ideas detected by the model. The primary
task of the judges is to assess if the content of the detected ideas is useful, and to
which product development team within Lego, the ideas might be useful. Doing this
manual classification would also improve and keep the model up to date, because one
would then be able to retrain the model each time an idea is confirmed or
disconfirmed by the human judge.
The organization utilizing crowdsourcing for new product development,
should consider the deviations in creativity due to seasonality and historical events.
As we have shown there are certain time dependent events where people are more
likely to discuss ideas in online communities. In a business context this means that if
56 of 65
Lego would like to utilize crowdsourcing for producing new ideas, they should take
these events into account. As an example one can imagine if Lego are following a
yearly product launch cycle, they should consider that their crowdsourcing
community is very likely to produce more ideas around summer holiday and
Christmas. Assuming this is the case, it makes sense to schedule future product
launches with this in mind. Giving a specific example, the development of next years
Lego Christmas toy collection (If such exist), should start in February, as the
crowdsourcing community would have peaked in its production of ideas at this point.
Detecting ideas by means of text mining and machine learning
As a final personal note it is our impression that ideas can be detected in online
communities, as shown in this thesis. However, one might consider reframing the
concept of our classification task to “detecting the creative process in online
communities”, because n-grams like “you could”, “one could”, “we could” etc., are
expressions of individuals who are currently in the early stages of the creative
process. Framing the task as “detecting ideas” emphasises the final outcome more
than the creative process, which we do not believe is realistic to detect by the means
applied in this thesis. We do however believe that detecting messages which is a part
of a creative process is very realistic as shown in this thesis. For future research this
can help us discover where ideas are generated, as we consider it a reasonable
assumption that if one can detect creative messages within a thread on e.g. Facebook,
the thread is likely to contain one, or several ideas.
$
57 of 65
8$% Bibliography$
Albors, J., Ramos, J. C., & Hervas, J. L. (2008). New learning network paradigms:
Communities of objectives, crowdsourcing, wikis and open source.
International Journal of Information Management, 28(3), 194–202.
Amabile, T. M. (1983). The social psychology of creativity: A componential
conceptualization. Journal of Personality and Social Psychology, 45(2), 20.
Argamon, S., & Olsen, M. (2006). Toward meaningful computing. Communications
of the ACM, 49(4), 33–35.
Ben-Hur, A., & Weston, J. (2010). A user’s guide to support vector machines.
Methods in Molecular Biology, 609, 223–239.
Bishop, J. (2009). Enhancing the understanding of genres of web-based communities:
The role of the ecological cognition framework. International Journal of Web
Based Communities, 5(1).
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal
margin classifiers. In COLT ’92 Proceedings of the fifth annual workshop on
Computational learning theory.
Brabham, D. C. (2008). Crowdsourcing as a model for problem solving an
introduction and cases. Convergence: The International Journal of Research
into New Media Technologies, 14(1), 75–90.
Brynjolfsson, E. (2010). The four ways IT is driving innovation. MIT Sloan
Management Review, 51(3).
Buecheler, T., Sieg, J. H., Füchslin, R. M., & Pfeifer, R. (2010). Crowdsourcing, open
innovation and collective intelligence in the scientific method: a research
agenda and operational framework. In Artificial life XII. Proceedings of the
58 of 65
twelfth international conference on the synthesis and simulation of living
systems, Odense, Denmark (pp. 19–23).
Bughin, J., Chui, M., & Manyika, J. (2010). Clouds, big data, and smart assets: Ten
tech-enabled business trends to watch. McKinsey Quarterly, 56.
Burroughs, J. E., Morreau, C. P., & Mick, D. G. (2008). Toward a psychology of
consumer creativity. In Handbook of consumer psychology (pp. 1011 – 1038).
New York, NY: Psychology Press, Taylor & Francis Group, LLC.
Chawla, N. V. (2010). Data mining for imbalanced datasets: An overview. Data
Mining and Knowledge Discovery Handbook, 875–886.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20(1), 37–46.
Conway, D., & M. White, J. (2012). Machine learning for hackers (First edition.).
Sebastopol, CA: O’Reilly Media, Inc.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3),
273–297.
Dahlander, L., Frederiksen, L., & Rullani, F. (2008). Online communities and open
Innovation. Industry & Innovation, 15(2), 115–123.
Dharmadhikari, S. C., Ingle, M., & Kulkarni, P. (2011). Empirical studies on machine
learning based text classification algorithms. Advanced Computing: An
International Journal (ACIJ), 2(6), 161–169.
Di Gangi, P. M., Wasko, M. M., & Hooker, R. E. (2010). Getting customers’ ideas to
work for you: Learning from Dell how to succeed with online user innovation
communities. MIS Quarterly Executive, 9(4), 213–228.
59 of 65
Drummond, C., & Holte, R. C. (2003). C4. 5, class imbalance, and cost sensitivity:
Why under-sampling beats over-sampling. In Workshop on Learning from
Imbalanced Datasets II.
Enkel, E., Gassmann, O., & Chesbrough, H. (2009). Open R&D and open innovation:
exploring the phenomenon. R&D Management, 39(4), 311–316.
Eric Bonabeau. (2009). Decisions 2.0: The power of collective intelligence. MIT
Sloan Management Review, 50(2), 45–52.
Erk, K., & Padó, S. (2008). A structured vector space model for word meaning in
context. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing (pp. 897–906).
Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for
learning from imbalanced data sets. Computational Intelligence, 20(1), 18–36.
Estellés-Arolas, E., & González-Ladrón-de-Guevara, F. (2012). Towards an
integrated crowdsourcing definition. Journal of Information science, 38(2),
189–200.
Faraj, S., Jarvenpaa, S. L., & Majchrzak, A. (2011). Knowledge Collaboration in
Online Communities. Organization Science, 22(5), 1224–1239.
doi:10.1287/orsc.1100.0614
Feinerer, 2012, & Kurt, H. (2012). Ingo Feinerer (2012). tm: Text Mining Package.
R package version 0.5-7.1. Journal of Statistical Software, 25(5).
Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal
of Statistical Software, 25(5), 1–54.
Feinerer, Ingo. (2012). Ingo Feinerer (2012). tm.plugin.mail: Text Mining E-Mail
Plug-In. R package version 0.0-5. http://CRAN.R-
project.org/package=tm.plugin.mail.
60 of 65
Feldman, R., & Sanger, J. (2006). The text mining handbook: Advanced approaches
in analyzing unstructured data. Cambridge University Press.
Fischer, G., Giaccardi, E., Eden, H., Sugimoto, M., & Ye, Y. (2005). Beyond binary
choices: Integrating individual and social creativity. International Journal of
Human-Computer Studies, 63(4-5), 482–512.
Forman, G. (2003). An extensive empirical study of feature selection metrics for text
classification. The Journal of Machine Learning Research, 3, 1289–1305.
Gobble, M. M. (2013). Resources: Big data: The next big thing in innovation.
Research-Technology Management, 56(1), 64–67.
Goldenberg, J., Lehmann, D. R., & Mazursky, D. (2001). The idea Itself and the
circumstances of its emergence as predictors of new product success.
Management Science, 47(1), 69.
Grün, B., & Hornik, K. (2011). Topicmodels: An R package for fitting topic models.
Journal of Statistical Software, 40(13), 1–30.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer
classification using support vector machines. Machine learning, 46(1), 389–
422.
Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques (2. edition.).
San Francisco, CA: Morgan Kaufmann.
Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning -
data mining, inference and prediction (Second edition.). Stanford, CA:
Springer.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. Knowledge and Data
Engineering, IEEE Transactions on, 21(9), 1263–1284.
61 of 65
Hennessey, B. A., & Amabile, T. M. (2010). Creativity. Annual Review of
Psychology, 61(1), 569–598.
Hsinchun Chen, Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and
analytics: From big data to big impact. MIS Quarterly, 36(4), 1165–1188.
Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (2010). A practical guide to support vector
classification (p. 16). Taiwan: National Taiwan University.
Johnson, J. E. (2012). Big data + Big analytics = Big opportunity. Financial
Executive, 28(6), 50–53.
Kao, A., & Poteet, S. R. (2007). Natural language processing and text Mining.
London, UK: Springer-Verlag.
Kaufmann, G. (2004). Two kinds of creativity–but which ones? Creativity and
innovation Management, 13(3), 154–165.
Keerthi, S. S., & Lin, C. J. (2003). Asymptotic behaviors of support vector machines
with Gaussian kernel. Neural computation, 15(7), 1667–1689.
Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). Handling imbalanced
datasets: A review. GESTS International Transactions on Computer Science
and Engineering, 30(1), 25–36.
Lai, C.-C. (2007). An empirical study of three machine learning methods for spam
filtering. Knowledge-Based Systems, 20(3), 249–254.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for
categorical data. Biometrics, 33(1), 159–174.
LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big
data, analytics and the path from insights to value. MIT sloan management
review, 52(2), 21–32.
62 of 65
Leimeister, J. M. (2010). Collective intelligence. Business & Information Systems
Engineering, 2(4), 245–248.
Linoff, G., & Berry, M. (2011). Data mining techniques: For marketing, sales, and
customer relationship management (3. Edition.). Indianapolis, IN: Wiley
publishing.
Malone, T., Laubacher, R., & Dellarocas, C. (2009). Harnessing crowds: Mapping the
genome of collective intelligence. Retrieved from
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1381502
Malone, T. W., Laubacher, R., & Dellarocas, C. (2010). The collective intelligence
genome. MIT Sloan Management Review, 51(3), 13.
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes
text classification. In AAAI-98 workshop on learning for text categorization
(Vol. 752, pp. 41–48).
Nadkarni, P. M., Ohno-Machado, L., & Chapman, W. W. (2011). Natural language
processing: An introduction. Journal of the American Medical Informatics
Association, 18(5), 544–551.
Poetz, M. K., & Schreier, M. (2012). The value of crowdsourcing: can users really
compete with professionals in generating new product ideas? Journal of
Product Innovation Management, 29(2), 245–256.
Reisberg, D. (2010). Cognition: Exploring the science of the mind (4. edition.). New
York, NY.
Ryan, R. M., & Deci, E. L. (2000). Intrinsic and extrinsic motivations: Classic
definitions and new directions. Contemporary Educational Psychology, (25),
54–67.
63 of 65
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text
retrieval. Information processing & management, 24(5), 513–523.
Sarkar, P., & Chakrabarti, A. (2011). Assessing design creativity. Design Studies,
32(4), 348–383.
Scholderer, J. (2013). Support vector machines. Presented at the Data mining lecture
on support vector machines, Aarhus, Denmark.
Sculley, D., & Wachman, G. M. (2007). Relaxed online SVMs for spam filtering. In
Proceedings of the 30th annual international ACM SIGIR conference on
Research and development in information retrieval (pp. 415–422).
Segaran, T. (2007). Programming collective intelligence (1. edition.). Sebastopol,
CA: O’Reilly Media, Inc.
Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of latent
semantic analysis, 427(7), 424–440.
Tan, A. H. (1999). Text mining: The state of the art and the challenges. In
Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from
Advanced Databases (pp. 65–70).
Tapscott, D., & Williams, A. D. (2008). Wikinomics - How mass collaboration
changes everything (Expanded edition.). London, UK: Atlantic Books.
Varewyck, M., & Martens, J.-P. (2011). A practical approach to model selection for
support vector machines with a Gaussian kernel. IEEE Transactions on
Systems, Man, and Cybernetics, Part B (Cybernetics), 41(2), 330–340.
Vukovic, M., & Bartolini, C. (2010). Towards a research agenda for enterprise
crowdsourcing. In Leveraging Applications of Formal Methods, Verification,
and Validation (pp. 425–434). Berlin, Germany: Springer.
64 of 65
Webb, S., Chitti, S., & Pu, C. (2005). An experimental evaluation of spam filter
performance and robustness against attack. In Collaborative Computing:
Networking, Applications and Worksharing, 2005 International Conference on
(p. 8).
Weiss, G. M., & Provost, F. (2001). The effect of class distribution on classifier
learning: an empirical study (Technical report No. ML-TR-44). Newark, NY:
Department of computer science, Rutgers university. Retrieved from
ftp://ftp.cs.rutgers.edu/http/cs/cs/pub/technical-reports/work/ml-tr-44.pdf
Weiss, G. M., & Provost, F. J. (2003). Learning when training data are costly: The
effect of class distribution on tree induction. J. Artif. Intell. Res. (JAIR), 19,
315–354.
Wilson, S. M., & Peterson, L. C. (2002). The anthropology of online communities.
Annual Review of Anthropology, 31(1), 449–467.
Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and
techniques (2. edition.). San Francisco, CA: Morgan Kaufmann publishers.
Xu-Ying Liu, Jianxin Wu, & Zhi-Hua Zhou. (2009). Exploratory undersampling for
class-imbalance learning. IEEE Transactions on Systems, Man, and
Cybernetics, Part B (Cybernetics), 39(2), 539–550.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text
categorization. In Machine learning-international workshop then conference
proceedings (pp. 412–420).
Yu, B. (2008). An evaluation of text classification methods for literary study. Literary
and Linguistic Computing, 23(3), 327–343. doi:10.1093/llc/fqn015
Zanasi, A. (2007). Text mining and its applications to intelligence, CRM and
knowledge management (1. edition.). Southampton, UK: WIT Press.
65 of 65
Zhang, L., Zhu, J., & Yao, T. (2004). An evaluation of statistical spam filtering
techniques. ACM Transactions on Asian Language Information Processing
(TALIP), 3(4), 243–269.
Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text categorization on
imbalanced data. ACM SIGKDD Explorations Newsletter, 6(1), 80–89.
A
Appendix A - Message view at www.lugnet.com
$
Source: “http://news.lugnet.com/dear-Lego/?n=10”
Comment: This document is shown through a regular Internet browser
B
Appendix B - Message in .eml format
Comment: This document is shown through a regular mail software
C
Appendix C - Message in .txt format
Comments: This document is shown through a text editor
D
Appendix D - Descriptive statistics of regression data
Variable( n( mean( sd( median( trimmed( mad( min( max( range( skew( kurtosis( se(ACTIVITY' 120.00' 3495.95' 2373.25' 3322.50' 3315.27' 2989.66' 338.00' 9369.00' 9031.00' 0.50' 30.76' 216.65'IDEA' 120.00' 114.59' 88.28' 98.00' 105.74' 94.89' 4.00' 420.00' 416.00' 0.87' 0.40' 8.06'ERPM' 120.00' 0.03' 0.01' 0.03' 0.03' 0.01' 0.01' 0.11' 0.10' 2.52' 11.93' 0.00'LN.IDEA' 120.00' 33.48' 0.37' 33.50' 33.48' 0.31' 34.74' 32.07' 2.67' 0.17' 2.02' 0.03'