Magazine recommendations based on social media trends · 2014-01-20 · Magazine recommendations...
Transcript of Magazine recommendations based on social media trends · 2014-01-20 · Magazine recommendations...
Magazine recommendationsbased on social media trends
Steffen Karlsson
Kongens Lyngby 2014B.Eng-2014
Technical University of DenmarkDepartment of Applied Mathematics and Computer ScienceMatematiktorvet, building 303B,2800 Kongens Lyngby, DenmarkPhone +45 4525 [email protected] B.Eng-2014
Summary (English)
Issuu uses a recommendation engine, for predicting what a certain readerwill enjoy. It is based on collaborative filtering, such as reading history ofother similar users and content-based filtering reflected as the document’stopics etc. So far all of those parameters, are completely isolated from anyexternal (non-Issuu) sources causing the Matthew Effect. This project,done in collaboration with Issuu, is the first attempt to solve the problem,by investigating how to extract trends from social media and incorporatethem to improve Issuu’s magazine recommendations.
Popular social media networks have been investigated and evaluated re-sulting in choosing Twitter as the data source. A framework for spottingtrends in the data has been implemented. To map trends to Issuu two ap-proaches have been used - Latent Dirichlet Allocation model and ApacheSolr search engine.
ii
Summary (Danish)
Issuu benytter sig af et anbefalingssystem til at forudsige, hvad der vilglæde en given læser. Det er baseret på collaborative filtering såsom læsehistorik fra lignende brugere. Derudover er det baseret på indholdsbase-ret filtrering, der afspejles som dokumentets tema mv. Hidtil er alle disseparametre fuldstændig isoleret fra eksterne (ikke Issuu) kilder. Dette pro-jekt er udført i samarbejde med Issuu og er det første forsøg på at løseproblemet. Dette er gjort ved at undersøge, hvorledes man kan udtræk-ke tendenser fra sociale medier og integere dem, for at forbedre Issuu’smagasin anbefalinger.
Populære sociale medier er blevet undersøgt og evalueret, hvilket resulte-rer i at Twitter blev valgt som datakilde. Et system til at spotte trendenserpå i dataen er blevet implementeret. Der er benyttet to forskellige me-toder til at integere tendenserne på Issuu - Latent Dirichlet Allocationmodellen og Apache Solr søgemaskine.
iv
Preface
This thesis was prepared at the department of Applied Mathematics andComputer Science at the Technical University of Denmark (DTU) in ful-fillment of the requirements for acquiring an B.Eng. in IT. The work wascarried out in the period September 2013 to January 2014.
I would like to thank my supervisor Ole Winther from DTU, my externalsupervisor Andrius Butkus and Issuu for spending time and resources onhaving me around.
Lyngby, 10-January-2014
Steffen Karlsson
vi
Contents
Summary (English) i
Summary (Danish) iii
Preface v
1 Introduction 11.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . 21.2 Social media . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 What is a trend? . . . . . . . . . . . . . . . . . . . . . . . 51.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 71.6 Expected results . . . . . . . . . . . . . . . . . . . . . . . 81.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Mining Twitter 92.1 Twitter API . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Tweet’s location problem . . . . . . . . . . . . . . . . . . . 11
3 Trending framework 153.1 Raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Normalizing data . . . . . . . . . . . . . . . . . . . . . . . 183.3 Detecting trends . . . . . . . . . . . . . . . . . . . . . . . 193.4 Recurring trends . . . . . . . . . . . . . . . . . . . . . . . 203.5 Trend score . . . . . . . . . . . . . . . . . . . . . . . . . . 21
viii CONTENTS
3.6 Aggregating trends . . . . . . . . . . . . . . . . . . . . . . 22
4 From trends to magazines 254.1 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Using LDA . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Results using LDA . . . . . . . . . . . . . . . . . . 304.3 Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Using Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1 Results using Solr . . . . . . . . . . . . . . . . . . 33
5 Conclusion 355.1 Improvements of the trending framework . . . . . . . . . . 365.2 Improvements of the LDA model . . . . . . . . . . . . . . 385.3 LDA vs. Solr . . . . . . . . . . . . . . . . . . . . . . . . . 39
A Dataset statistics 41A.1 Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41A.2 Hashtag . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
B Example: #bostonstrong 43
C Implementation details 47C.1 Flask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47C.2 Peewee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48C.3 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 49C.4 MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Bibliography 51
List of Figures
1.1 Typical patterns for slow and fast trends. . . . . . . . . . 6
1.2 Project flowchart . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Mining Twitter flowchart . . . . . . . . . . . . . . . . . . 9
2.2 Visualization of the problem with the location . . . . . . . 12
2.3 Visualization of the solution to the location problem . . . 12
3.1 Total tweets per hour . . . . . . . . . . . . . . . . . . . . . 16
3.2 Raw tweet count for hashtags . . . . . . . . . . . . . . . . 17
3.3 Weighted tweet count per hour . . . . . . . . . . . . . . . 18
3.4 Normalized hashtags . . . . . . . . . . . . . . . . . . . . . 18
3.5 Example sizes of w and r . . . . . . . . . . . . . . . . . . 19
3.6 w - r, where r = 2 hours. . . . . . . . . . . . . . . . . . . 20
x LIST OF FIGURES
3.7 w - r, where r = 24 hours. . . . . . . . . . . . . . . . . . . 21
3.8 Displyaing use of threshold in the trending framwork. . . . 22
3.9 E/R Diagram v2 . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Plate notation of the LDA model [Ble09] . . . . . . . . . . 26
4.2 LDA topic simplex, with three topics . . . . . . . . . . . . 27
4.3 Representation of topic distribution using dummy data . . 27
4.4 #apple tag cloud . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 From trend to magazines flowchart . . . . . . . . . . . . . 29
4.6 Topic distribution for #apple tweets . . . . . . . . . . . . 29
4.7 Subset of the similar #apple documents using LDA . . . . 30
4.8 Example of tokenizing and stemming . . . . . . . . . . . 31
4.9 Subset of the similar #apple documents using Solr . . . . 33
5.1 Supported languages by Issuu . . . . . . . . . . . . . . . . 36
5.2 Translation module to improve the solution. . . . . . . . . 37
5.3 Three simultaneously running trending frameworks. . . . . 37
5.4 Top words in the topics. . . . . . . . . . . . . . . . . . . . 38
5.5 LDA per page solution. . . . . . . . . . . . . . . . . . . . 39
A.1 Top and bottom 10 of used locations . . . . . . . . . . . . 41
LIST OF FIGURES xi
A.2 Top and bottom 10 of used hashtags . . . . . . . . . . . . 42
B.1 Total tweets per hour . . . . . . . . . . . . . . . . . . . . . 43
B.2 Raw tweet count for hashtags . . . . . . . . . . . . . . . . 44
B.3 Fully processed data . . . . . . . . . . . . . . . . . . . . . 44
B.4 #bostonstrong tag cloud . . . . . . . . . . . . . . . . . . 45
B.5 Subset of #bostonstrong LDA documents . . . . . . . . . 46
B.6 Subset of #bostonstrong Solr documents . . . . . . . . . 46
C.1 E/R Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 49
xii LIST OF FIGURES
Chapter 1
Introduction
Issuu1 is a leading online publishing platform with more than 15 millionpublications - a pool that keeps growing by more than 20 thousand newones each day. The main challenge for the reader then becomes the nav-igation and discovery of interesting content, among the vast number ofdocuments. To solve the problem Issuu uses a recommendation enginefor predicting what a certain reader might enjoy.
Currently a whole range of parameters are a part of Issuu’s recommen-dation algorithm: reader’s location and language preferences (context),reading history of other similar users (collaborative filtering [RIS+94],[SM95]), document’s topics (content-based filtering [Sal89]) and docu-ment’s overall popularity. Also there are editorial and promoted docu-ments. So far all of those parameters, are completely isolated from anyexternal (non-Issuu) sources.
1www.issuu.com
2 Introduction
The main problem is that the same magazines constantly get recom-mended again and again. It highlights the shortcomings with collabora-tive filtering, rather than reading habits of Issuu users. Issuu does notallow readers to rate magazines, so the read time is used instead. Nat-urally popular magazines gather their read-times very quick and thenare hard to beat by the newly uploaded ones. They get recommendedmore and by that they only become stronger - a phenomena known asthe Mathew Effect [Jac88].
Incorporating local trends (what is happening around the reader) intothe recommendations, would address this problem and add a bit morefreshness and serendipity.
1.1 Problem definition
How to extract trends from social media and incorporate them, to improveIssuu’s magazine recommendations.
1.2 Social media
In this project, social media is the data source from which trends can beextracted. There are many social media platforms that could be used asthe data source for this project. Their suitability was evaluated based onthese parameters:
Data - Defines the format of the data, the amount of the data that isavailable and how semantically rich is it. This is the most impor-tant parameter, since it’s all about the quality of the data and willdirectly impact the ability to extract trends. Text is a preferreddata format here. The more data, the better - since it will addstability to the resulting trends. Semantic richness is about, howmuch meaning can be extracted from the data.
1.2 Social media 3
We should not expect any highly organized and semantically richtaxonomies, since the Twitter is a crowd driven social media in-stead of editorially curated and organized. In social networks wenormally see data being organized as folksonomies2, where "mul-tiple users tag particular content with a variety of terms from avariety of vocabularies, thus creating a greater amount of meta-data for that content [Wal05]. Semantic richness in folksonomiescomes from multiple users tagging the data with the same labels,which shows that they agree on what it is about. It can be narrowor broad. In narrow ones only the creator of the content is allowedto label it with tags, while in broad ones multiple users can labela piece of content. Broad folksonomies are more stable and infor-mative given that there are enough users to label things, and is thepreferred one in this project.
Real-time - Defines the time from something important happening inthe world, until it appears on the particular social media network.An API supporting real-time streaming of data is naturally prefer-able, but a small delay is also acceptable.
Accessibility - Defines if there are any restrictions throughout the API,that limits the accessibility of data.
The most popular social networks were evaluated based on these threeparameters. The one that fitted the best appeared to be Twitter3 (seeTable 1.1). Facebook4 and Google+5 scored well on the data part andreal-time but had to be ruled out due to the limited API access andstrict privacy settings. "The largest study ever conducted on Facebookon privacy, showed that in June 2011 around 53% of the profiles whereprivate, which where an increase of 17% over 15 months." [Sag12].
2A term coined by Thomas Vander Wal, combining words folk and taxonomy.3www.twitter.com4www.facebook.com5plus.google.com
4 Introduction
Positive Negative
Data
Average of 58 milliontweets each day6.Over 85% of topicsare headline or per-sistent news in na-ture [KLPM10].
Length of the tweet.Lack of reliability accord-ing to the location preci-sion of the tweets.
Real-time Pseudo real-time locationbased streaming service.
2 hours behind.
AccessibilityAPI easy accessible andusable.
Unpaid plan limited by a1% representative subsetof data.
Table 1.1: Evaluation of Twitter’s suitability for the project.
Linked-in7 was discarded because of the nature of the data - industryand career oriented. Instagram8, Pinterest9 and Flickr10 are all big andinteresting, but the data they provide is mostly images and thus hard tointerpret, also their data is not that close to trending news. Same goesfor YouTube11 and Vine12.
It is worth to mention that trends can be spotted on Issuu as well. Oneof the problems is that they have a huge delay since Issuu users are notas active as other social networks. Also on Issuu trends would have tobe inferred from what people read instead of what they are posting orcommenting.
7www.linkedin.com8www.instagram.com9www.pinterest.com
10www.flickr.com11www.youtube.com12vine.twitter.com
1.3 What is a trend? 5
1.3 What is a trend?
Trend can be understood in many different ways depending on the context- stock market, fashion, music, news, etc. The dictionary defines a trendas:
"a general direction in which something is developing orchanging" - Definition in the dictionary.
In this project trends will be considered a bit differently. Basically Issuuis interested in knowing what topic or event is currently hot in whichcountry (or other even smaller area) and recommend magazines similarto it. On Twitter trends can be spotted by looking at the hashtags, soin this project trending hashtags and trends, will be considered the samething. Trends are taken as a “hashtag-driven topic that is immediatelypopular at a particular time"13.
Trends vary in terms of how unexpected they are. Seasonal holidays likeChristmas or Halloween are trends, but very expected ones. On the otherhand, Schumacher’s skiing accident is a very unexpected one, both typesare equally interesting and valuable for Issuu. Another parameter is thespeed of how quickly, the trend is raising. We can have slow or fast trends(see Figure 1.1), the priority is spotting trends that raise fast.
Trend and popularity is not the same thing. If something becomes popularall of a sudden - it is a trend. But if it keeps being popular, it is not atrend anymore.
1.4 Related work
Extracting trends from Twitter is nothing new. The two widely usedapproaches are parametric or non-parametric. The most popular one isthe parametric approach, where a trending hashtag is being detected,by observing it’s deviation based on some baseline [IHS06], [BNG11],
13www.hashtags.org/platforms/twitter/what-do-twitter-trends-mean/
6 Introduction1 1
2 0
1 0
0 2
0 4
0 7
0 5
0 8
1 10
0 15
2 12
16 16
30 18
3 20
2 16
1 22
2 18
1 16
1 14
1 17
0 14
0 3
0 0
0 0
1 1
2 2
1 1
1 1
1 1
1 1
Cou
nt
Fast trendSlow trend
Cou
nt
time period ≈ 24 hours
0 24Figure 1.1: Typical patterns for slow and fast trends.
[CDCS10], using a sliding window. It’s the simplest of approaches andstill quite successful, based on the assumption that different trends willbehave similarly to one another. It’s known that this is not the case, inthe real world - there are many types of trends, with all kinds of patterns.
To address that problem, other non-parametric methods have been usedas well [Nik12]. In those ones the parameters were not set in advance,but were learned from the data instead. Many patterns were observedand grouped into the ones that became trends and the ones that didn’t.New hashtag patterns can then be compared to the observed ones usingeuclidean distance, the similarity can then be used to determinate if it istrending or not.
The requirements for spotting trends at Issuu, are not that strict - there’sno need to capture all the trends from a certain day, but instead justthe most significant ones. It makes things simpler and that’s why, theparametric model was chosen for this project. It is the first time, thatIssuu is doing a project like this, so the idea was to try the simpler thingsfirst, to see if they work. If not the more heavy non-parametric models,could be applied.
1.5 Methodology 7
1.5 Methodology
Figure 1.2 is illustrating the methodology of the project.
! !! I⚙ "Twitter Data Trending
Framework Trends Issuu Documents
Figure 1.2: Project flowchart
It’s important to note early how this trend data will be used by Issuubecause it sets requirements on other parts of the projects. Issuu is usingLatent Dirichlet Allocation (LDA) [BNJ03], to extract topics from it’sdocuments, using the Gensim implementation [ŘS10]. Using the Jensen-Shannon distance (JSD) algorithm it is possible to compare documentsto one other, using the LDA topic distribution. This allows Issuu to findsimilar documents, to the one that is being read, for example.
If we can capture trends from social media and express them as text("virtual document"), we could calculate LDA for the trend (one text fileper trend) and use JSD to find similar documents.
Issuu is using Apache Solr14 search engine, which takes text as input, andcan give similar documents as output. This is another approach and willbe investigated, whether this may be used as an alternative/complementto LDA.
With all that in mind, the plan is this:
• Access Twitter API15 and retrieve tweets, from a given country ona given time and storing them in a database.
14lucene.apache.org/solr/15dev.twitter.com
8 Introduction
• Calculate trends from the tweets, the output of this step are the listof trending hashtags, per given time window.
• Find out how to feed those trends, into both LDA topic model andSolr search engine
• Get documents as the final result and evaluate.
1.6 Expected results
• Analysis of potential resources for mining data from social medianetworks, to be used at Issuu as basis for recommendations.
• Data mining algorithms (Python16) to retrieve all the necessarydata.
• An algorithm for extracting trends from tweets.
• A method of feeding trends into the LDA model and Solr.
• Evaluation of the results and final recommendations on the end-to-end solution, for incorporating social media data into Issuu’srecommendation engine.
1.7 Outline
Chapter 2 is explaining how to retrieve tweets from the Twitter APIservice, which are being processed and analyzed in Chapter 3. Thetrends are fed into the LDA model and Solr search engine, resulting insimilar documents in Chapter 4. The final recommendations on theend-to-end solution are being evaluated in Chapter 5.
16www.python.org
Chapter 2
Mining Twitter
This chapter is about retrieving tweets from Twitter and storing them fortrend extraction later. USA was chosen as the country for this projectbecause of several reasons. First of all, most Issuu readers are from theUSA. Secondly, more than half of Twitter users are from the USA too[Bee12]. Finally, having tweets in english makes it simpler, because Issuu’sLDA model was trained on english Wikipedia and sticking to the englishtweets means that no translation will be needed.
!Trend related
Tweets
IIssuu’s
LDA
##Topic
Distribution
!Similar
documents
␡␡
␡
␡
␡
␡
$All tweets
in the worldLocation
filter [USA]
!Tweets in USA
$Database
Figure 2.1: Mining Twitter flowchart
10 Mining Twitter
The data in Twitter are 140 character long messages called tweets. Oftenthey contain some additional meta-data:
Symbol Description Example
#
Grouping tweets togetherby type or topic, known asa hashtag.
Wow, Mac OS X Maver-icks is free and will beavailable for machines go-ing back as far as 2007?#Apple #Keynote
@Used to referencing, men-tioning or replying anotheruser.
@alastormspotter: iOS7will release at around nooncentral time onWednesday.
RTSymbolizing a retweet(posting an existing tweetfrom another user).
RT @ThomasCDec: 50 daysto #ElectionDay
Table 2.1: Additional meta-data used in tweets.
2.1 Twitter API
The Twitter API is providing two different calls which may be suitablefor this purpose:
GET search/tweets : Is part of the ordinary API, i.e. with the ratelimit of 450 requests per 15 minutes and continuations url, whichmeans that there are a finite number of tweets per request beforerequesting next chunk.
POST statuses/filter : Is part of the streaming API, as mentioned inTable 1.1. Which for the unpaid plan, has the limitation of onlyproviding a 1% representative subset of the full dataset.
2.2 Tweet’s location problem 11
Twitter is using a three step heuristic, to determine whether a given Tweetfalls within the specified location defined as a bounding box1:
1. If the tweet is geo-location tagged, this location will be used forcomparison with the bounding box.
2. A user on Twitter can in the account settings specify location, whichin the API calls refers as place, and this will be used for comparisonif the tweets is not geo tagged.
3. If neither of the rules listed above match, the tweet will be ignoredby the streaming API.
The streaming API was chosen, because it takes all the three heuristicsinto account, whereas the search API only includes the second. Addition-ally it is difficult to know how frequently to execute the API call, in orderto be up-to-date, due to the limitations.
2.2 Tweet’s location problem
A couple of problems where spotted with the location accuracy:
1. Twitters API supports streaming by location, but only with coor-dinates in sets of SW and NE, defining each country by a square.Figure 2.2 shows the tweets streamed from USA (tweets that areactually from USA have been filtered, to provide a better overview).
2. Although the selected bounding box is covering USA and even more,tweets from Guatemala and Honduras is still present (see Figure2.2).
1Two pairs of longitude and latitude coordinates; south-west (SW) and north-east(NE) corner of a rectangle
12 Mining Twitter
Figure 2.2: Visualization of the problem with the Twitter service, whereeach red dot represents a tweet. Duration is 1 hour and thenumber of tweets with a wrong location is 7,240.
Figure 2.3: Visualization of the solution to the Twitter service problem,where each red dot represents a tweet. Duration is 1 hourand the number of tweets is 112,851, this means an errorrate of approximately 7%.
2.2 Tweet’s location problem 13
Applying the location filter to the streaming API, means that the bound-ing box needs to be known. GeoNames2 solves the problem, by providingall coordinates needed for all countries.
The two problems spotted regarding the location accuracy, turned outto have the same solution. Algorithm 1 investigates whether the currentreceived tweet are from the same country as desired. The ones which are,will be stored in the MySQL3 database for further analysis. AppendixC contains information about, the database choice and implementationdetails including E/R diagram.
A problem occurred, some of the tweets where missing the country code,which means that they could not be processed. To solve this Open-Streetmap’s reverse geocoding API4 was used, which has the ability toconvert longitude and latitude value pairs to a country code.
Algorithm 1: Parse tweet from dataif coordinates in data then
if place not in data thencountry_code = reverse geocode coordinates
if tweet.country_code is chosen country_code then#Parse the rest of the tweetadd tweet to database
elseraise LocationNotAvailableException
For debugging purposes an interactive tweet-map, which is a graphicalinteractive way of visualizing tweet, has been created (used at Figure 2.2)and Figure 2.3). It is a JavaScript/HTML5 based website hosted locallyin Python with the module Flask5 and implementation details availablein Section C.1.
2www.geonames.org - Licensed under a Creative Commons attribution license,which gives you free access to: Share - to copy, distribute and transmit the workand Remix - to adapt the work to make commercial use of
3www.mysql.com4wiki.openstreetmap.org/wiki/Nominatim/5flask.pocoo.org
14 Mining Twitter
Chapter 3
Trending framework
In the previous chapter it was described how the tweets were collectedensuring their location accuracy and stored in the MySQL database. Thischapter focuses on how to turn those tweets into trends. "Fast" trendswere chosen for this project, because it would have the most impact com-pared to the "slow" trends. Eventually most trends will appear on Issuu- having huge delay, since Issuu users are not as active as other socialnetworks. Therefore the challenge is to reduce this delay.
To illustrate the idea a time period of three days was chosen knowing inadvance that there were several trends in there and testing if the algorithmcan find them.
On October 22nd Apple held it’s annual event where it has presented theupdated product line (new iPads, MacBooks and of course the new OSXMavericks). This event was chosen as one of the examples to start with.
16 Trending framework
3.1 Raw data
At first, we will take a look at the raw data from the database. Inaddition also a 3 consecutive days subset, which will be used as example,to describe the trending framework:
Type Count
Full dataset Example
Duration (hours) 1462 72Tweets 127,930,378 4,103,273Hashtags 25,502,269 770,453Unique hashtags 3,180,466 206,850Avg. tweets per day 2,099,858 1,367,757Avg. length per tweet (char) 56 54Avg. words per tweet 9.5 9.1
Table 3.1: Facts about the dataset collected.
More statistics about the dataset are presented in Appendix A.
The plot of total tweets per hour at Figure 3.1 - where the x-axis rep-resents the 3 days (72 hours) and 0, 24 and 72 is midnight (this alsoapplies for the other plots in this chapter) - clearly shows, that the fre-quency/fluctuation of tweets reflects the same day/night rhythm as hu-mans, which was as expected.
tweets per hour
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
1 6 12 18 24
#apple
0 12 3624 48 60 72
1 12 3624 48 60 72
1 12 3624 48 60 72
1 12 3624 48 60 72
24 36 48 60 72
Untitled 1 Untitled 6 Untitled 11 Untitled 16 Untitled 2124 36 4830 453327
wr
2 hou
rs
1 hou
r
30,000
1,800
0.07
0.06
0.06
0
0
Figure 3.1: Total tweets per hour
3.1 Raw data 17
Hashtags is used to categorize/label the tweet, with one word or phraseand can be used to spot the trends in the tweets. The full text of thetweets, could also have been used, this option has been tested and foundto be generally too vague.
Figure 3.2, shows the total amount of tweets of the chosen hashtags. Asdescribed before Apple is one of them, the other two are; the TV-show"Pretty Little Liars" which was shown the same day and the hashtags"jobs", which is a way companies use to identify a job opening on Twitter.
These three different hashtags represent different kinds of trends: one-time-events, weekly recurring and daily recurring, which will be describedlater in this chapter.
tweets per hour
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
1 6 12 18 24
#apple
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
Untitled 1 Untitled 6 Untitled 11 Untitled 16 Untitled 2124 36 4830 453327
wr
2 hou
rs
1 hou
r
30,000
1,800
0.07
0.06
0.06
0
0
0 12 3624 48 60 72
Figure 3.2: Raw tweet count for hashtags
Due to the quite big fluctuation in the total amount of tweets during a day,a weight function with the purpose of reducing the importance of tweets,which are tweeted during nights, expressed as a sigmoid function1 hasbeen created (this could also have been another mathematical functionlike hyperbolic tangent):
wt =1
1 + exp(−(m× (tweets ∈ t−X))), (3.1)
where t defines a time period and m is the slope of the curve, whichcan be defined as how "expensive" it is, to have a tweet count below thepreferred amount X (see Figure 3.3).
1en.wikipedia.org/wiki/Sigmoid_function/
18 Trending framework
tweets per hour
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
1 6 12 18 24
#apple
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
Untitled 1 Untitled 6 Untitled 11 Untitled 16 Untitled 21
24
36 4830 453327
wr
2 hou
rs1 h
our
30,000
1,800
0.07
0.06
0.06
0
0
0 12 3624 48 60 72
!
☀
#pll
#pll
#jobs
weighted tweet count
0 12 3624 48 60 72
1.0
Figure 3.3: Weighted tweet count per hour
3.2 Normalizing data
To get reasonable results from data which vary in the amount (in thiscase, the total amount of tweets pr hour), it is highly recommended tonormalize the data. In this project every hashtag will be normalized bythe total amount of tweets in the time period t, this results in a normalizedvalue for each hashtag:
ft =|tweets 3 the hashtag|
|tweets ∈ t|(3.2)
Figure 3.4 shows the result of applying Equation 3.2 to the data at Figure3.2. The hashtags seems to follow the same pattern, although there issome differences during the night.
tweets per hour
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
1 6 12 18 24
#apple
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
Untitled 1 Untitled 6 Untitled 11 Untitled 16 Untitled 2124 36 4830 453327
wr
2 hou
rs
1 hou
r
30,000
1,800
0.07
0.06
0.06
0
0
0 12 3624 48 60 72
Figure 3.4: Normalized hashtags
3.3 Detecting trends 19
3.3 Detecting trends
To detect a trend, is it important to know how the hashtag has behaved inthe previous time window (reference window r), before the current timewindow w. Sizes of the w and r are parameters in the framework and aretunable.
Figure 3.5 shows an example of the values, for the reference window andcurrent window, which respectively is 2 hours and 1 hour.
Twee
t cou
nt
tweets pr hour
#pll#apple#jobs
#pll#apple#jobs
Tren
d sc
ore
#pll#apple#jobs
#pll#apple#jobs
1 6 12 18 24
#apple
1 12 3624 48 60 72
1 12 3624 48 60 72
1 12 3624 48 60 72
1 12 3624 48 60 72
1 12 3624 48 60 72
24 36 4830 453327
wr
2 hou
rs
1 hou
r
Figure 3.5: Example sizes of w and r.
To be able to know if the term is a trend, the normalized reference window(ft_ref ) will be subtracted from the current window, to find out whetherthe interest has increased:
ft_ref =
|r|∑|the hashtag ∈ tweets|
|r|∑|tweets ∈ t|
(3.3)
where r is a list of reference windows. The outcome of this step can beseen in Figure 3.6. It clearly shows that it has a huge impact on the"jobs" hashtag, whose influence has dropped.
20 Trending framework
tweets per hour
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
1 6 12 18 24
#apple
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
Untitled 1 Untitled 6 Untitled 11 Untitled 16 Untitled 2124 36 4830 453327
wr
2 hou
rs
1 hou
r
30,000
1,800
0.07
0.06
0.06
0
0
0 12 3624 48 60 72
Figure 3.6: w - r, where r = 2 hours.
3.4 Recurring trends
The size of w (current window) and r (reference window) creates problemsregarding interfering recurring trends. Hashtags like "jobs" turns out asa trend each day, no matter the fact that it follow the same pattern eachday (see Figure 3.2).
These types can be daily, weekly or yearly recurring defined as:
Day - Examples of recurring daily trends will be hashtags such as "jobs".This hashtag is recurring each day, but not necessarily on the sametime, and tests shows that the amount tends to be a bit lower duringthe weekend.
Week - TV-shows is a great example of trends, which are recurring eachweek on the same day and time as long as they are shown. Anothertype of weekly recurring trends is the natural difference betweenweekdays with work and weekends.
Year - New Years Eve, Christmas or Halloween are all examples of yearlyrecurring trends.
3.5 Trend score 21
In this project the focus is to get rid of the daily recurring trends. Theissue will be solved, by subtracting the maximal value of the hashtag fromthe day before the time period t :
max(fi), i ∈ {t− 24; t} (3.4)
This would make the framework sensitive for outliers. But in this projectoutliers are in fact what is being looked for - trends. Subtracting themaximum does not mean that a hashtag can not be trending two days ina row, the amount of tweets containing it, just needs to rise.
The weekly and yearly recurring trends where not implemented, but theprinciple is the same.
3.5 Trend score
Combining Equation 3.1, 3.2, 3.3 and 3.4 gives the complete Equation 3.5for calculating the trend_score of a hashtag on a given time t :
trend_score = (ft − ft_ref −max(fi))× wt, i ∈ {t− 24; t} (3.5)
Figure 3.7 shows the final result of the trending framework after applyingEquation 3.5 to the data:
tweets per hour
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
1 6 12 18 24
#apple
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
Untitled 1 Untitled 6 Untitled 11 Untitled 16 Untitled 2124 36 4830 453327
wr
2 hou
rs
1 hou
r
30,000
1,800
0.07
0.06
0.06
0
0
0 12 3624 48 60 72
Figure 3.7: w - r, where r = 24 hours.
where the x -axis represents time (as for the other plots), where the y-axisrepresents the trend_score.
22 Trending framework
In this plot the importance of the hashtag "jobs" is reduced to a point,where it is insignificant (a trend score below 0) and the other two turnsout to be trendy, which is exactly what we want.
Sometimes the framework produces too many trends than needed forIssuu. Because of this a threshold has been set as a limit. If and only ifthe trend_score is above the threshold, will the hashtag be accepted as atrend. The threshold is a parameter which is tunable
tweets per hour
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
#pll#apple#jobs
1 6 12 18 24
#apple
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
0 12 3624 48 60 72
Untitled 1 Untitled 6 Untitled 11 Untitled 16 Untitled 21
24
36 4830 453327
wr
2 hou
rs1 h
our
30,000
1,800
0.07
0.06
0.06
0
0
0 12 3624 48 60 72
!
☀
#pll
#jobs
weighted tweet count
0 12 3624 48 60 72
1.0
Figure 3.8: Trend scores for all trends in the timeperiod of 22. October.The red line displays the chosen threshold.
Figure 3.8 is a visual representation of the trending scores of the Appleevent day, the 22nd of October. The red line represents the threshold.
3.6 Aggregating trends
At Issuu it is not very likely that users come back each hour, but morelikely ones a day, therefore it would be unnecessary to recommend newdocument each hour. A solution where it is possible to aggregate trendthroughout longer periods, which should be tunable, would be preferable.The computationally and extensible most optimal solution would be toextend the existing database (explained in Section C.3), make it able tostore trends and references to corresponding tweets.
3.6 Aggregating trends 23
Two new tables: trend and tweet_trend_relation, where added to thedatabase, containing the trend and the time where it where trendy (seeFigure 3.9).
Figure 3.9: E/R Diagram v2
24 Trending framework
Chapter 4
From trends to magazines
This chapter is all about mapping the computed trends to Issuu, whichcan then be presented as similar magazines/documents for the users. Twodifferent approaches will be investigated in order to solve this problem:Latent Dirichlet Allocation (LDA) and Apache Solr search engine
4.1 LDA
Latent Dirichlet Allocation (LDA) is a generative probabilistic model,that allows to automatically discover the topics in a document. A topicis defined by the probability distribution over words in a fixed vocabulary,which means that each topic, contains a probability for each word.
LDA can be expressed as a graphical model, known as the plate notation(Figure 4.1)
26 From trends to magazines
α
ND
Observed word
Topic Hyperparameter
Per-word topic assignment
Per-document topic proportion
Dirichlet parameter
θd βZd,n
Wd,n
Figure 4.1: Plate notation of the LDA model [Ble09]
where,
Variable Definition
D The number of documents.N Total number of words in all documents.W The observed word n for the document dZ Assigns the topics for the n’th word in the d ’th
document.α K dimensions per-document topic distribution vector,
where K is the number of topics.β Y dimensions per-topic word distribution vector,
where Y defines the number of words in the corpus.θ Topic proportions for the d ’th document.
Table 4.1: Definition of LDA model parameters
4.1 LDA 27
Figure 4.2 is an example of a visual representation of a LDA space withthree topics. A given document x has a probability, to belong in eachtopic, which all sums up to 1. The corners of the simplex, corresponds tothe probability 1 for the given topic.
Topic 1
Topic 2
Topic 3
␡
␡
␡
␡
␡␡
␡
␡
Figure 4.2: LDA topic simplex, with three dummy topics.
Topic distribution for a document can be visualized using a bar-plot.This describes which topics are present in the document and by that, it’sunderlying hidden (latent) structure:
1 2 3 4 5 6 7 8 9 10 x-1 xTopic
Figure 4.3: Representation of topic distribution using dummy data,where the x-axis represents the x number of topics and they-axis represents the probability of belonging to the giventopic x.
At Issuu, the LDA model is trained on 4.5 million English Wikipedia1
articles. LDA makes the assumption that all words, in the same article issomehow related. Every article is unique in the sense, that it has uniquedistributions of words.
1www.wikipedia.org
28 From trends to magazines
This could be interpreted as a unique topic for each article, resulting in4.5 million topics. This would be useless, since the goal is to make amodel that finds similarities among documents, instead of declaring themall different. One of the main steps in LDA is dimensionality reductionwhere the number of topics is reduced (in Issuu case to 150 topics) forcingsimilar topics to “merge” and reveal deeper underlying patterns.
4.2 Using LDA
All the tweets containing the trending hashtag, will be used as data sourcefor Issuu’s LDA model, instead of only the hashtag itself. A hashtag doesnot provide enough information and context, to give a stable result fromIssuu’s LDA model. The model is context-dependent and would not beable to differentiate, between the fruit and the electronic company, basedon the hashtag #apple, without any context.
To give an idea of the richness of the context, behind the tweets from asingle hashtag like #apple (see Figure 4.4), two tag clouds were generated.A tag cloud is a visual representation of text, which favors the words thatis mostly used in the text, by either color or size.
Figure 4.4: Left: Tag cloud for all words, in the tweets containing#apple. Right: Same tagcloud after removing #apple,#free and #mavericks, to get a deeper understanding.
4.2 Using LDA 29
The flowchart (Figure 4.5) contains four steps, visualized as arrows anddenoted with numbers. It shows overall structure of how to turn trendsfrom Twitter, into similar magazines/documents using LDA.
!Trend related
TweetsIssuu’s
LDA
##Topic
Distribution
!Similar
documents
␡␡
␡
␡
␡
␡
$All tweets
in the worldLocation
filter [USA]
!Tweets in USA
$Database
%1 32 4
Figure 4.5: From trend to magazines flowchart
The tweets corresponding to the trending hashtag is feed in to the LDAmodel (step 1), which produce the topic distribution (step 2). Figure 4.6shows the topic distribution for the #apple tweets, where it is easy to seethat the software/electronics topic is dominating, as expected.0.000007945967
0.011839627149
0.000007945967
0.001273715413
0.010046488782
0.020267836621
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.009953154993
0.000007945967
0.000007945967
0.002414483548
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.008195320570
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.026455528532
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.003986194631
0.000007945967
0.005491066789
0.000007945967
0.000007945967
0.000007945967
0.001353193101
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.001436402768
0.005035626070
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000921799862
0.061102617619
0.000007945967
0.001225442698
0.000007945967
0.000007945967
0.005200749857
0.000007945967
0.000007945967
0.009508312644
0.002172815590
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.001200310125
0.017425434171
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.051578089927
0.002091128987
0.000007945967
0.447611991520
0.000007945967
0.003475499744
0.000007945967
0.050404963941
0.000007945967
0.005985321782
0.000007945967
0.000007945967
0.000007945967
0.001527621895
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.006667538538
0.000007945967
0.009114833388
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.113652862505
0.000007945967
0.000007945967
0.019504673443
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.009007024311
0.007163380404
0.046579565550
0.000007945967
0.004113406199
0.000007945967
0.000007945967
0.000007945967
0.000007945967
0.003511620738
0.000007945967
0.000007945967
0.000007945967
0.005560819158
0.003021698753
0.000007945967
0.000007945967
0.000007945967
0.002047781268
Software
Figure 4.6: Topic distribution for #apple tweets
Using this LDA topic distribution, it is possible, in combination withthe Jensen-Shannon divergence algorithm (Equation 4.1) (step 3), to findsimilar magazines to recommend from Issuu (step 4):
JSD(P‖Q) =1
2D(P‖Q)+
1
2D(Q‖M),
where M =1
2D(P +Q),
(4.1)
where P and Q are two probability distributions, which in Issuu’s caseare two LDA topic distributions.
30 From trends to magazines
4.2.1 Results using LDA
Figure 4.7 is a a subset of the magazines found being similar using LDA,to the tweets containing the hashtag #apple.
Figure 4.7: Subset of the similar #apple documents using LDA.NB: The documents are blurred due to copyright issues andthe terms of services/privacy policy on Issuu, this appliesfor all figures which shows magazine covers
The resulting magazines range from learning material such as "... ForDummies" to magazines like "Computer Magazine" and "macworld".The popularity and the date they are uploaded on Issuu differs frommagazine to magazine. These would be parameters which also could beused to weight the documents.
4.3 Solr 31
4.3 Solr
Lucene is a Java-based high-performance text search engine library. Adocument in Lucene terms, is not a document as we know it, but merelya collection of fields, which describes the document. For any given doc-ument, these fields could be information like the title and the amount ofpages. Lucene uses text analyzer, which tokenize the data from a field,into a series of words. After that a process called stemming will be per-formed, which is reducing all the words to its stem/base. See Figure 4.8for example.
Magazine recommendations based on social media trends
Magazine
recommendations
based social mediatrends
on
recommendtrend
base
Tokenization Stemming
Text to be processed Tokenized text Stemmed words
Magazineon
socialmedia
Figure 4.8: Example of tokenizing and stemming
Lucene uses term frequency-inverse document frequency (tf-idf ) as a partof the scoring model, where tf is the term frequency in a document, whichis the measure of how often, a term appears in a document. Idf is themeasure of how often a term, appears across a collection of documents.
Solr is a open sourced enterprise search server, used widely by serviceslike Netflix2. It is a web application service build around Lucene, addinguseful functionality such as geospatial search, replication and web admin-istrative interface for configuration.
2www.netflix.com
32 From trends to magazines
4.4 Using Solr
In this project an integration with Issuu Solr server has been created,the use of the search engine is throughout HTTP requests and JSON3
responses. JSON is a easy human-readable open standard text format,which is mostly used to transfer data between servers and web application,like in this project. An example of a request could be:
<base url>?q=apple+mavericks+free+new&
wt=json&debug=true&start=0&row=50&
where,
Parameter Description
base url Address to access the Solr search engine.
debug If true, the response will contain addition informa-tion, including scores and reason for each document.
q Main text to be queried in the request.
rows Maximum number of results - used to paginate.
start Used to define current position, in combination withrows to paginate.
wt The format of the response like json.
Table 4.2: Explanation of parameters to Solr.
The q parameter is constructed using the x most occurring words, in thetweets corresponding to the trending hashtag, where x is tunable.
3json.org
4.4 Using Solr 33
4.4.1 Results using Solr
The similar magazines produced by Solr (Figure 4.9) are more diversemore than the similar documents from the LDA model (Figure 4.7). Theyrange from Apple magazines and learning material to magazines aboutthe surf spot "Mavericks" in California and the NBA (National BasketballAssociation) team Dallas Mavericks.
Figure 4.9: Subset of the similar #apple documents using Solr
Appendix B contains a complete example, using the hashtag #bostonstrongdisplaying the complete process from tweets to similar documents usingboth LDA and Solr.
34 From trends to magazines
Chapter 5
Conclusion
A prototype of an end-to-end solution, with the purpose of spotting lo-cation based trends from a social media network and mapping them toIssuu, has been developed. Twitter was selected as the data source, be-cause it suited the requirements the best, as described in section 1.2.
Both of the results (LDA: Figure 4.7 and Solr: 4.9) suggest that improve-ments could be made. Four improvements for the trending frameworkand three for the LDA was found useful:
• Trending Framework:
1. Support the existing 28 languages supported1 by Issuu.
2. Capture "Slow" trends.
3. Non-parametric model.
4. Recurring weekly and yearly trends.1Magazine written in other languages does not have a LDA topic distribution,
because only those languages are incorporated into the translation framework andtherefore can be translated to English.
36 Conclusion
• LDA:
1. Limited by Issuu’s LDA model.2. Wikipedia lacks certain topics.3. Big magazines results in many topics.
5.1 Improvements of the trending framework
USA was chosen as location in this prototype, because most Issuu readersare from USA and more than half of Twitter users are from the countrytoo. In the future the goal will be to support the existing 28 languagessupported by Issuu (Figure 5.1).
EnglishSpanishGermanFrenchPortugueseRussianArabicItalianDutch
TurkishFarsiPolishIndonesianSwedish
NorwegianCatalan
CzechHebrew
DanishFinnishRomanianHungarianCroatianIcelandic
Supported Not supported
Figure 5.1: List of languages supported by Issuu, including coloredworld map of countries speaking those languages.
A solution is to extend the existing end-to-end solution with Issuu’s trans-lation framework (Figure 5.2), which will be able to translate all non-English tweets to English (1st improvement of the trending framework).
A tuneable trending framework has been developed, which is capable ofspotting "fast" trends on Twitter and reduce the importance on dailyrecurring trends. The hashtags: #apple and #pll were found to betrendy the 22nd of October, among a vast number of unimportant re-curring trends. It was the day Apple presented their updated productline including the new OSX Mavericks2 and the Halloween episode of thepopular TV-show "Pretty Little Liars (pll)" was shown.
2Apple Special Event: http://www.apple.com/apple-events/october-2013/
5.1 Improvements of the trending framework 37
! # $%
&
MySQL Database Trends English? Issuu’s
LDATranslation
''
TopicDistribution
Yes
No
Figure 5.2: Translation module to improve the solution.
Hashtags like #happyhalloween described in Appendix B, along with#bostonstrong, was possible to spot during the 31st of October, butthe ability to spot "slow" trends (described in Section 1.3) would needto be improved (2nd improvement of the trending framework). A possi-ble solution is to run multiple instances of the framework simultaneously,with various sizes of the current window (w) and the reference window(r) (Figure 5.3).
! !
⚙2h
⚙6h
⚙12h
#
#
#MySQL
Database Tweets Trending Frameworks Trends
Fast
Slow
Slow
Figure 5.3: Three simultaneously running trending frameworks.
Creating a new trending framework, which is build on a non-parametricmodel (3rd improvement of the trending framework) [Nik12] would makethe system more robust and faster to spot the trends. Robust becauseparameters like the threshold is not defined from the beginning, but ob-served from training data, which is used decide whether a new dataset isa trend or not.
38 Conclusion
A last improvement (4th) would be to implement the weekly and yearlyrecurring trends (as described in Section 3.4), which is basically the sameprincipal as the daily recurring trends.
5.2 Improvements of the LDA model
This project is limited by Issuu’s LDA model (1st improvement of theLDA model), the similar magazines found using the Apple tweets (Figure4.7) includes magazines about Microsoft too, because that Apple is partof the overall Software topic, which also includes Google and Microsoftetc. (see Figure 5.4).
Topics about technology, computers and software
#98 #70 #1 #66
undosoftwarewindowsuserservercomputergooglemicrosoftdomainweblinuxos programversionbrowserfilesmac system open programming file computing download free apple
1
100K
user link addedblacklistcoibot reported resolves accounts users additions reporting involved report records wikipedia mentioned whitelistmonitor interest monitorlist domainredlist org adding conflict
data code internet local network service mobile users access digital computer services using ip available address web phone technology system mail online networks application via
sat search admins poly cycle graph problem algorithm node logic rp step np arrow graphs edge problemsinterwiki optimization path arrows tree algorithms xwiki fpc
apple freemavericksosxnewos todayxmac eventkeynoteproavailable watchingipadmacbookloveohiworkwantblackgolooksgoodcook
Top words in #apple tweets
… … … … …
Figure 5.4: Top words in the topics.
Wikipedia is great resource to use as a text corpus for training the LDAmodel because it is free, it is broad enough to cover multiple topics andvery clean and focused in terms of each article being about one topic.For example Wikipedia article about Italian food is very unlikely to writeabout technology or cars. The main disadvantage of using Wikipedia atIssuu is that Wikipedia is not evenly covering all possible themes (2nd
5.3 LDA vs. Solr 39
improvement of the LDA model) - it is overloaded on business and tech-nology, while lacking more in entertainment topics. That is one of thereasons why certain topics in Issuu’s LDA model are combined together(for example American football and baseball) thus making it unable for usto distinguish between them. This will be addressed when Issuu launchestheir new LDA model with more topics.
Magazines are often broad and big and therefore includes many topics,this is of course a problem, because a single topic distribution is computedfrom the whole magazine. To address this problem (3rd improvement ofthe LDA model), a solution would be to compute LDA per page instead(Figure 5.5), which makes it possible to recommend single pages withina magazine.
"
Magazine
✈
)
⚽
Pages
+
JSD
,
-
.
/
0
1
2
3
4
Similar pages
''
''''
Topic Distribution
''
# Trend
…
…
…
…
…
Figure 5.5: LDA per page solution.
5.3 LDA vs. Solr
The two approaches - LDA (Section 4.1) and Solr (Section 4.3) - differ inthe results they provide. One could argue that the quality of the LDAapproach is better that the Solr. Once both systems are launched live,an A/B testing3 could be used to see which one fits better for Issuu.
3en.wikipedia.org/wiki/A/B_testing
40 Conclusion
Appendix A
Dataset statistics
A.1 Location
Los Angeles 2051577
Texas 1678012
Georgia 1539281
New York 1503027
Manhattan 1435282
Chicago 1280852
Florida 1273262
Philadelphia 1178112
Ohio 1116043
South Carolina 1078361
Oberon 1
Macy's 1
Woodberry 1
Juntura 1
Conconully 1
German Valley 1
Unionville Center 1
Cedarbend 1
Funkley 1
Alicia 1
job 472043
jobs 412111
tweetmyjobs 230950
oomf 133661
wcw 84900
pdx 76820
veteranjob 68357
mcm 61772
coupon 56949
nursing 55968
gtav 11728
apple 7388
pll 6070
wrongconversation
1
billycorgan 1
justwantcheesecake
1
stalkerwife 1
travelquestions 1
uneedheadandshoulders
1
hdbros 1
thoughtweweregonnamove
1
quietdownpeople
1
xodb 1
Los Angeles
Texas
Georgia
New York
Manhattan
Chicago
Florida
Philadelphia
Ohio
South Carolina
Oberon
Macy's
Woodberry
Juntura
Conconully
German Valley
Unionville Center
Cedarbend
Funkley
Alicia
625,000 1,250,000 1,875,000 2,500,0001
1
1
1
1
1
1
1
1
1
1,078,361
1,116,043
1,178,112
1,273,262
1,280,852
1,435,282
1,503,027
1,539,281
1,678,012
2,051,577
job
jobs
tweetmyjobs
oomf
wcw
pdx
veteranjob
mcm
coupon
nursing
gtav
apple
pll
wrongconversation
billycorgan
justwantcheesecake
stalkerwife
travelquestions
uneedheadandshoulders
hdbros
thoughtweweregonnamove
quietdownpeople
xodb
150,000 300,000 450,000 600,0001
1
1
1
1
1
1
1
1
1
6,070
7,388
11,728
55,968
56,949
61,772
68,357
76,820
84,900
133,661
230,950
412,111
472,043
Figure A.1: Top and bottom 10 of used locations
42 Dataset statistics
A.2 Hashtag
Los Angeles 2051577
Texas 1678012
Georgia 1539281
New York 1503027
Manhattan 1435282
Chicago 1280852
Florida 1273262
Philadelphia 1178112
Ohio 1116043
South Carolina 1078361
Oberon 1
Macy's 1
Woodberry 1
Juntura 1
Conconully 1
German Valley 1
Unionville Center 1
Cedarbend 1
Funkley 1
Alicia 1
job 472043
jobs 412111
tweetmyjobs 230950
oomf 133661
wcw 84900
pdx 76820
veteranjob 68357
mcm 61772
coupon 56949
nursing 55968
happyhalloween 13274
bostonstrong 9913
apple 7388
pll 6070
wrongconversation
1
billycorgan 1
justwantcheesecake
1
stalkerwife 1
travelquestions 1
uneedheadandshoulders
1
hdbros 1
thoughtweweregonnamove
1
quietdownpeople
1
xodb 1
Los Angeles
Texas
Georgia
New York
Manhattan
Chicago
Florida
Philadelphia
Ohio
South Carolina
Oberon
Macy's
Woodberry
Juntura
Conconully
German Valley
Unionville Center
Cedarbend
Funkley
Alicia
625,000 1,250,000 1,875,000 2,500,0001
1
1
1
1
1
1
1
1
1
1,078,361
1,116,043
1,178,112
1,273,262
1,280,852
1,435,282
1,503,027
1,539,281
1,678,012
2,051,577
job
jobs
tweetmyjobs
oomf
wcw
pdx
veteranjob
mcm
coupon
nursing
happyhalloween
bostonstrong
apple
pll
wrongconversation
billycorgan
justwantcheesecake
stalkerwife
travelquestions
uneedheadandshoulders
hdbros
thoughtweweregonnamove
quietdownpeople
xodb
150,000 300,000 450,000 600,0001
1
1
1
1
1
1
1
1
1
6,070
7,388
9,913
13,274
55,968
56,949
61,772
68,357
76,820
84,900
133,661
230,950
412,111
472,043
Figure A.2: Top and bottom 10 of used hashtags. Including thefour hashtags analyzed in this project: happyhalloween,bostonstrong, apple and pll.
Appendix B
Example: #bostonstrong
This is a full example of how the trending framework works (part 1) andthe transformation from trends to magazine/documents on Issuu (part2). Short facts about the subset of the dataset is as follows, 5.465.644tweets containing 1.195.695 hashtags, where 280.855 is unique.
The fluctuation of the total amount of tweets per hour, follows the samepattern as described in section 3.1, as expected:
total tweets hour Wx constants count bostonstrong count happyhalloween count jobs fraction bostonstrong fraction happyhalloween fraction jobs ref fraction bostonstrong ref fraction happyhalloween ref fraction jobs A bostonstrong A happyhalloween A jobs FINAL bostonstrong FINAL happyhalloween FINAL jobs
157,676 80000 65 8 15 0.000412237753 0.000050736954 0.000095131789
155,341 0.00009 178 10 15 0.001145866191 0.000064374505 0.000096561758
130,396 9 9 20 0.000069020522 0.000069020522 0.000153378938
90,009 5 8 44 0.000055550001 0.000088880001 0.000488840005
54,489 1 3 34 0.000018352328 0.000055056984 0.000623979152
29,956 0 2 44 0.000000000000 0.000066764588 0.001468820937
15,455 4 0 34 0.000258815917 0.000000000000 0.002199935296
10,737 1 1 46 0.000093135885 0.000093135885 0.004284250722
20,177 7 0 25 0.000346929672 0.000000000000 0.001239034544
34,251 5 2 98 0.000145981139 0.000058392456 0.002861230329
43,133 6 2 94 0.000139104630 0.000046368210 0.002179305868
58,313 2 3 94 0.000034297669 0.000051446504 0.001611990465
75,845 4 6 296 0.000052739139 0.000079108709 0.003902696288
85,045 4 5 1015 0.000047033923 0.000058792404 0.011934858016
89,980 4 8 691 0.000044454323 0.000088908646 0.007679484330
90,166 10 17 818 0.000110906550 0.000188541135 0.009072155802
90,364 2 6 807 0.000022132708 0.000066398123 0.008930547563
92,980 2 8 656 0.000021510002 0.000086040009 0.007055280706
94,524 3 3 289 0.000031737971 0.000031737971 0.003057424569
99,682 5 6 222 0.000050159507 0.000060191409 0.002227082121
108,332 7 5 251 0.000064616180 0.000046154414 0.002316951593
128,469 2 6 175 0.000015567958 0.000046703874 0.001362196328
149,266 4 14 148 0.000026797797 0.000093792290 0.000991518497
154,043 7 9 39 0.000045441857 0.000058425245 0.000253176061
154,824 0.998811839382 5 9 18 0.000032294735 0.000058130522 0.000116261045 0.000036266646 0.000075830259 0.000616532975 -0.000003971911 -0.000017699737 -0.000500271930 -0.001148471910 -0.000205995824 -0.012420355014
145,399 0.997229380851 1 10 26 0.000006877626 0.000068776264 0.000178818286 0.000038851674 0.000058277511 0.000184545452 -0.000031974048 0.000010498753 -0.000005727165 -0.001174576892 -0.000177549095 -0.011907502368
120,202 0.973869650226 1 6 9 0.000008319329 0.000049915975 0.000074873962 0.000019985144 0.000063286291 0.000146557725 -0.000011665815 -0.000013370316 -0.000071683763 -0.001127285290 -0.000196635434 -0.011692806643
89,740 0.706117162445 5 7 13 0.000055716514 0.000078003120 0.000144862937 0.000007530092 0.000060240737 0.000131776612 0.000048186422 0.000017762383 0.000013086325 -0.000775090524 -0.000120589808 -0.008418167598
47,567 0.051223735645 2 4 10 0.000042045956 0.000084091912 0.000210229781 0.000028579322 0.000061921864 0.000104790847 0.000013466634 0.000022170048 0.000105438934 -0.000058005736 -0.000008522149 -0.000605947036
26,504 0.008044895961 1 1 11 0.000037730154 0.000037730154 0.000415031693 0.000050980649 0.000080112449 0.000167507847 -0.000013250495 -0.000042382295 0.000247523846 -0.000009324973 -0.000001857755 -0.000094023387
14,263 0.002687829072 0 0 28 0.000000000000 0.000000000000 0.001963121363 0.000040501681 0.000067502801 0.000283511766 -0.000040501681 -0.000067502801 0.001679609597 -0.000003188754 -0.000000688202 -0.000027564355
10,032 0.001838215698 0 2 21 0.000000000000 0.000199362041 0.002093301435 0.000024529644 0.000024529644 0.000956656119 -0.000024529644 0.000174832397 0.001136645316 -0.000002151440 -0.000000025200 -0.000019849444
18,514 0.003935633580 2 0 15 0.000108026358 0.000000000000 0.000810197688 0.000000000000 0.000082321465 0.002016875900 0.000108026358 -0.000082321465 -0.001206678212 -0.000004084557 -0.000001066016 -0.000051720271
33,579 0.015099336268 8 4 71 0.000238244141 0.000119122070 0.002114416749 0.000070062355 0.000070062355 0.001261122399 0.000168181785 0.000049059715 0.000853294350 -0.000014762386 -0.000002106077 -0.000167324256
44,987 0.041045201687 12 3 99 0.000266743726 0.000066685931 0.002200635739 0.000191964371 0.000076785749 0.001650893594 0.000074779355 -0.000010099817 0.000549742145 -0.000043962975 -0.000008153258 -0.000467304377
56,699 0.109379979688 11 8 81 0.000194006949 0.000141095963 0.001428596624 0.000254563043 0.000089097065 0.002163785862 -0.000060556094 0.000051998898 -0.000735189237 -0.000131958445 -0.000014934987 -0.001385849511
70,860 0.305212030948 16 11 601 0.000225797347 0.000155235676 0.008481512842 0.000226186496 0.000108176150 0.001770155184 -0.000000389149 0.000047059526 0.006711357659 -0.000349850920 -0.000043181889 -0.001594275153
78,774 0.472442953055 15 7 978 0.000190418158 0.000088861807 0.012415263920 0.000211666758 0.000148950682 0.005346545520 -0.000021248600 -0.000060088874 0.007068718399 -0.000551395158 -0.000117463496 -0.002298973371
81,468 0.532982036922 9 10 1049 0.000110472824 0.000122747582 0.012876221338 0.000207172167 0.000120293516 0.010552414558 -0.000096699343 0.000002454066 0.002323806780 -0.000662265109 -0.000099181065 -0.005122517665
80,945 0.521249692402 9 11 966 0.000111186608 0.000135894743 0.011934029279 0.000149773468 0.000106089540 0.012649617454 -0.000038586859 0.000029805204 -0.000715588174 -0.000617395788 -0.000082741055 -0.006594041186
82,375 0.553234966201 10 21 868 0.000121396055 0.000254931715 0.010537177542 0.000110828567 0.000129299994 0.012406642325 0.000010567488 0.000125631720 -0.001869464784 -0.000628086940 -0.000034803688 -0.007637034058
84,356 0.596773691157 13 11 949 0.000154108777 0.000130399734 0.011249940727 0.000116336027 0.000195934362 0.011229488121 0.000037772750 -0.000065534628 0.000020452606 -0.000661281013 -0.000151625731 -0.007110203695
86,172 0.635406055845 7 20 789 0.000081232883 0.000232093952 0.009156106392 0.000137946753 0.000191925917 0.010897793452 -0.000056713869 0.000040168035 -0.001741687060 -0.000764126653 -0.000094277166 -0.008690159564
89,045 0.692971862621 19 11 254 0.000213375260 0.000123533045 0.002852490314 0.000117282792 0.000181788328 0.010191874648 0.000096092467 -0.000058255283 -0.007339384334 -0.000727463653 -0.000171022974 -0.013356507622
99,441 0.851913697642 30 22 178 0.000301686427 0.000221236713 0.001790006134 0.000148387428 0.000176923472 0.005952618753 0.000153298999 0.000044313241 -0.004162612618 -0.000845581587 -0.000122869719 -0.013713655731
114,784 0.958135861164 81 35 175 0.000705673265 0.000304920546 0.001524602732 0.000259966257 0.000175079316 0.002291947413 0.000445707007 0.000129841230 -0.000767344680 -0.000670847623 -0.000056242484 -0.012170435920
132,495 0.991203360961 129 26 146 0.000973621646 0.000196233820 0.001101928375 0.000518146808 0.000266075388 0.001647800210 0.000455474838 -0.000069841568 -0.000545871835 -0.000684318230 -0.000256109804 -0.012370941376
152,910 0.998588796229 355 55 35 0.002321627101 0.000359688706 0.000228892813 0.000849243163 0.000246684919 0.001298128834 0.001472383938 0.000113003787 -0.001069236021 0.000326056964 -0.000075430750 -0.012985742611
153,378 0.998646922963 208 46 61 0.001356126694 0.000299912634 0.000397710232 0.001695835742 0.000283807221 0.000634186507 -0.000339709049 0.000016105413 -0.000236476275 -0.002657735157 -0.000343118398 -0.013094955123
153,392 0.998648624464 2163 41 36 0.014101126526 0.000267289037 0.000234692813 0.001838139268 0.000329755002 0.000313430497 0.012262987258 -0.000062465965 -0.000078737684 0.009927925646 -0.000421584181 -0.012937452007
119,355 0.971858093513 397 87 48 0.003326211721 0.000728917934 0.000402161619 0.007728917430 0.000283600091 0.000316197803 -0.004402705709 0.000445317843 0.000085963816 -0.006535097264 0.000083219370 -0.012430315292
82,169 0.548648112120 116 89 48 0.001411724616 0.001083133542 0.000584161910 0.009385987747 0.000469299387 0.000307977723 -0.007974263131 0.000613834155 0.000276184187 -0.005648820738 0.000139436421 -0.006912986596
49,355 0.059633623157 38 44 36 0.000769932124 0.000891500355 0.000729409381 0.002545602509 0.000873345110 0.000476370060 -0.001775670384 0.000018155244 0.000253039321 -0.000244336694 -0.000020366878 -0.000752766079
27,905 0.009116147803 17 46 67 0.000609209819 0.001648450099 0.002401003404 0.001170888963 0.001011222286 0.000638666707 -0.000561679144 0.000637227812 0.001762336697 -0.000026284646 0.000002530088 -0.000101315815
15,213 0.002927046635 3 40 38 0.000197199763 0.002629330178 0.002497863669 0.000711881957 0.001164897748 0.001333160756 -0.000514682194 0.001464432430 0.001164702913 -0.000008302010 0.000003233636 -0.000034280161
10,524 0.001921280998 11 43 59 0.001045229951 0.004085898898 0.005606233371 0.000463843406 0.001994526648 0.002435177884 0.000581386544 0.002091372250 0.003171055487 -0.000003343491 0.000003327051 -0.000018646351
19,994 0.004493856737 33 98 43 0.001650495149 0.004901470441 0.002150645194 0.000543963943 0.003224929090 0.003768893033 0.001106531206 0.001676541351 -0.001618247840 -0.000005460467 0.000005917747 -0.000065136068
37,977 0.022268325137 58 242 134 0.001527240172 0.006372277958 0.003528451431 0.001441772069 0.004620224130 0.003342289796 0.000085468103 0.001752053828 0.000186161635 -0.000049795516 0.000031005639 -0.000282586375
47,841 0.052435557917 57 398 248 0.001191446667 0.008319224096 0.005183838130 0.001569750392 0.005865001466 0.003053250763 -0.000378303725 0.002454222630 0.002130587367 -0.000141572379 0.000109828055 -0.000563453312
60,822 0.151097373647 48 519 164 0.000789188123 0.008533096577 0.002696392753 0.001340045212 0.007457642919 0.004451280617 -0.000550857089 0.001075453658 -0.001754887864 -0.000434024817 0.000108150204 -0.002210722174
73,817 0.364364626367 50 537 333 0.000677350746 0.007274747009 0.004511155967 0.000966290274 0.008438935056 0.003791538978 -0.000288939528 -0.001164188047 0.000719616988 -0.000951198134 -0.000555246784 -0.004429436602
79,683 0.492867983759 30 575 615 0.000376491849 0.007216093772 0.007718082904 0.000727872310 0.007843195508 0.003691352431 -0.000351380461 -0.000627101736 0.004026730472 -0.001317439848 -0.000486357416 -0.004361630721
82,518 0.556413771330 29 570 698 0.000351438474 0.006907583800 0.008458760513 0.000521172638 0.007244299674 0.006175895765 -0.000169734164 -0.000336715874 0.002282864747 -0.001386227717 -0.000387489099 -0.005894289492
82,125 0.547667296341 21 576 952 0.000255707763 0.007013698630 0.011592085236 0.000363746216 0.007059142669 0.008094894606 -0.000108038454 -0.000045444039 0.003497190630 -0.001330648365 -0.000221877955 -0.005136588390
80,785 0.517655156915 13 484 826 0.000160920963 0.005991211240 0.010224670421 0.000303687372 0.006960514568 0.010021683278 -0.000142766409 -0.000969303328 0.000202987143 -0.001275706009 -0.000687959580 -0.006560365036
83,266 0.572960435379 11 474 808 0.000132106742 0.005692599620 0.009703840703 0.000208704192 0.006506660119 0.010914001596 -0.000076597450 -0.000814060499 -0.001210160893 -0.001374087783 -0.000672511855 -0.008070939696
82,325 0.552122454395 10 450 573 0.000121469784 0.005466140298 0.006960218646 0.000146295969 0.005839647427 0.009960317218 -0.000024826185 -0.000373507129 -0.003000098573 -0.001295529547 -0.000404813884 -0.008765672716
85,047 0.611644482390 9 494 236 0.000105823839 0.005808552918 0.002774936212 0.000126818487 0.005580013407 0.008339825232 -0.000020994648 0.000228539511 -0.005564889020 -0.001432851667 -0.000080216681 -0.011279403400
88,557 0.683549014676 18 656 151 0.000203258918 0.007407658344 0.001705116479 0.000113519585 0.005640130966 0.004833544440 0.000089739333 0.001767527379 -0.003128427961 -0.001525604685 0.000962326738 -0.010939962259
90,514 0.720362412173 18 643 90 0.000198864264 0.007103873434 0.000994321320 0.000155526370 0.006624271330 0.002229211308 0.000043337894 0.000479602104 -0.001234889988 -0.001641193909 0.000086381105 -0.010165114194
101,491 0.873712464645 15 657 87 0.000147796356 0.006473480407 0.000857218867 0.000201037577 0.007254105913 0.001345834892 -0.000053241221 -0.000780625506 -0.000488616026 -0.002074952055 -0.000996306741 -0.011677024993
115,342 0.960104553736 8 690 52 0.000069358950 0.005982209429 0.000450833174 0.000171870524 0.006770657014 0.000921850993 -0.000102511574 -0.000788447585 -0.000471017819 -0.002327426581 -0.001102330881 -0.012814745095
129,015 0.988006802870 14 618 23 0.000108514514 0.004790140681 0.000178273844 0.000106072415 0.006212154054 0.000641046335 0.000002442099 -0.001422013373 -0.000462772491 -0.013929596125 -0.009835716354 -0.011910281442
117,151 0.965894303881 9 552 25 0.000076823928 0.004711867590 0.000213399800 0.000090032207 0.005352823942 0.000306927978 -0.000013208279 -0.000640956352 -0.000093528178 -0.013632955591 -0.008861165468 -0.011287067434
93,636 0.773335144678 3 435 28 0.000032038959 0.004645649109 0.000299030287 0.000093432887 0.004752890326 0.000194990372 -0.000061393927 -0.000107241217 0.000104039915 -0.010952374803 -0.006681876878 -0.008884109190
71,741 0.322280762354 1 300 160 0.000013939031 0.004181709204 0.002230244909 0.000056929507 0.004682451954 0.000251438656 -0.000042990476 -0.000500742750 0.001978806253 -0.004558376810 -0.002911432625 -0.003098174879
50,220 0.064151868077 1 201 28 0.000019912386 0.004002389486 0.000557546794 0.000024187160 0.004444390695 0.001136796532 -0.000004274775 -0.000442001209 -0.000579249737 -0.000904887843 -0.000575769289 -0.000780813876
32,794 0.014083885541 0 112 48 0.000000000000 0.003415258889 0.001463682381 0.000016398685 0.004107870549 0.001541476374 -0.000016398685 -0.000692611660 -0.000077793993 -0.000198829609 -0.000129933819 -0.000164357243
20,598 0.004743713550 2 96 58 0.000097096806 0.004660646665 0.002815807360 0.000012046161 0.003770448358 0.000915508228 0.000085050645 0.000890198307 0.001900299132 -0.000066488249 -0.000036255720 -0.000045975057
14,780 0.002815488608 0 49 37 0.000000000000 0.003315290934 0.002503382950 0.000037458795 0.003895714714 0.001985316152 -0.000037458795 -0.000580423780 0.000518066798 -0.000039807026 -0.000025659013 -0.000031178773
20,730 0.004800132821 0 19 159 0.000000000000 0.000916546068 0.007670043415 0.000056532308 0.004098592346 0.002685284640 -0.000056532308 -0.003182046277 0.004984758775 -0.000067958643 -0.000056234242 -0.000031716045
33,437 0.014910453545 5 21 89 0.000149534946 0.000628046775 0.002661722044 0.000000000000 0.001914953534 0.005519571952 0.000149534946 -0.001286906760 -0.002857849907 -0.000208024558 -0.000146420704 -0.000215455087
43,474 0.036008064557 3 25 135 0.000069006763 0.000575056356 0.003105304320 0.000092307124 0.000738456994 0.004578433363 -0.000023300362 -0.000163400639 -0.001473129044 -0.000508593275 -0.000313144033 -0.000470453079
57,497 0.116575160226 4 35 138 0.000069568847 0.000608727412 0.002400125224 0.000104016331 0.000598093901 0.002912457256 -0.000034447483 0.000010633511 -0.000512332032 -0.001647856805 -0.000993507497 -0.001411074382
72,142 0.330213409302 3 27 326 0.000041584652 0.000374261872 0.004518865571 0.000069326836 0.000594230026 0.002703746620 -0.000027742184 -0.000219968154 0.001815118950 -0.004665541906 -0.002890379347 -0.003228485370
80,110 0.502474979786 4 39 680 0.000049931344 0.000486830608 0.008488328548 0.000053996097 0.000478251144 0.003579169849 -0.000004064752 0.000008579464 0.004909158700 -0.007087505702 -0.004283356564 -0.003358003376
84,234 0.594128736584 6 45 711 0.000071230145 0.000534226084 0.008440772135 0.000045976408 0.000433491842 0.006607466569 0.000025253737 0.000100734242 0.001833305566 -0.008362880516 -0.005009908780 -0.005797971436
86,018 0.632189190487 3 39 832 0.000034876421 0.000453393476 0.009672394150 0.000060847977 0.000511123010 0.008463953658 -0.000025971556 -0.000057729534 0.001208440492 -0.008930998700 -0.005431027405 -0.006564427965
84,815 0.606673333129 0 28 862 0.000000000000 0.000330130284 0.010163296587 0.000052862815 0.000493386274 0.009063035970 -0.000052862815 -0.000163255991 0.001100260616 -0.008586847890 -0.005275845198 -0.006365110212
85,221 0.615357720536 3 34 912 0.000035202591 0.000398962697 0.010701587637 0.000017561010 0.000392195887 0.009916116909 0.000017641581 0.000006766810 0.000785470727 -0.008666381193 -0.005246742850 -0.006649933671
85,339 0.617868296609 5 14 650 0.000058589859 0.000164051606 0.007616681705 0.000017643323 0.000364628667 0.010433084759 0.000040946537 -0.000200577061 -0.002816403054 -0.008687339460 -0.005396260054 -0.008902548116
84,941 0.609375994694 17 18 353 0.000200138920 0.000211911798 0.004155825809 0.000046904315 0.000281425891 0.009158067542 0.000153234605 -0.000069514094 -0.005002241734 -0.008499510513 -0.005242224434 -0.010112184503
86,959 0.651652295784 8 22 273 0.000091997378 0.000252992790 0.003139410527 0.000129198966 0.000187925769 0.005890298332 -0.000037201588 0.000065067020 -0.002750887805 -0.009213273974 -0.005518210901 -0.009346631311
91,068 0.730295040770 5 19 177 0.000054904028 0.000208635305 0.001943602583 0.000145433392 0.000232693426 0.003641652123 -0.000090529364 -0.000024058121 -0.001698049541 -0.010364095916 -0.006249247639 -0.009705719518
96,702 0.818048283021 7 18 166 0.000072387334 0.000186138860 0.001716613927 0.000073022631 0.000230302145 0.002527706471 -0.000000635297 -0.000044163286 -0.000811092544 -0.011535922047 -0.007016612704 -0.010146398287
105,291 0.906885813868 10 25 67 0.000094974879 0.000237437198 0.000636331690 0.000063907973 0.000197049582 0.001826702881 0.000031066907 0.000040387616 -0.001190371191 -0.012759937469 -0.007701917278 -0.011592228400
#bostonstrong#happyhalloween#jobs
1 12 3624 48 60 72
#bostonstrong#happyhalloween#jobs
1 12 3624 48 60 72
Tren
d sc
ore
#bostonstrong#happyhalloween#jobs
24 36 6048 72
Twee
t cou
nt
tweets per hour
1 12 3624 48 60 72
Figure B.1: Total tweets per hour
44 Example: #bostonstrong
Whereas the three different hashtags used in this example: #bostonstrong,#happyhalloween and #jobs, differs from the once, in the explanation oftrending framework in chapter 3. The hashtag #bostonstrong - used bythe fans, of the baseball team Boston Red Sox - was a unexpected event,because Red Sox became the champions of the Word Series1. Whereas#happyhalloween is hashtag used to celebrate Halloween, a yearly recur-ring event.
total tweets hour Wx constants count bostonstrong count happyhalloween count jobs fraction bostonstrong fraction happyhalloween fraction jobs ref fraction bostonstrong ref fraction happyhalloween ref fraction jobs A bostonstrong A happyhalloween A jobs FINAL bostonstrong FINAL happyhalloween FINAL jobs
157,676 80000 65 8 15 0.000412237753 0.000050736954 0.000095131789
155,341 0.00009 178 10 15 0.001145866191 0.000064374505 0.000096561758
130,396 9 9 20 0.000069020522 0.000069020522 0.000153378938
90,009 5 8 44 0.000055550001 0.000088880001 0.000488840005
54,489 1 3 34 0.000018352328 0.000055056984 0.000623979152
29,956 0 2 44 0.000000000000 0.000066764588 0.001468820937
15,455 4 0 34 0.000258815917 0.000000000000 0.002199935296
10,737 1 1 46 0.000093135885 0.000093135885 0.004284250722
20,177 7 0 25 0.000346929672 0.000000000000 0.001239034544
34,251 5 2 98 0.000145981139 0.000058392456 0.002861230329
43,133 6 2 94 0.000139104630 0.000046368210 0.002179305868
58,313 2 3 94 0.000034297669 0.000051446504 0.001611990465
75,845 4 6 296 0.000052739139 0.000079108709 0.003902696288
85,045 4 5 1015 0.000047033923 0.000058792404 0.011934858016
89,980 4 8 691 0.000044454323 0.000088908646 0.007679484330
90,166 10 17 818 0.000110906550 0.000188541135 0.009072155802
90,364 2 6 807 0.000022132708 0.000066398123 0.008930547563
92,980 2 8 656 0.000021510002 0.000086040009 0.007055280706
94,524 3 3 289 0.000031737971 0.000031737971 0.003057424569
99,682 5 6 222 0.000050159507 0.000060191409 0.002227082121
108,332 7 5 251 0.000064616180 0.000046154414 0.002316951593
128,469 2 6 175 0.000015567958 0.000046703874 0.001362196328
149,266 4 14 148 0.000026797797 0.000093792290 0.000991518497
154,043 7 9 39 0.000045441857 0.000058425245 0.000253176061
154,824 0.998811839382 5 9 18 0.000032294735 0.000058130522 0.000116261045 0.000036266646 0.000075830259 0.000616532975 -0.000003971911 -0.000017699737 -0.000500271930 -0.001148471910 -0.000205995824 -0.012420355014
145,399 0.997229380851 1 10 26 0.000006877626 0.000068776264 0.000178818286 0.000038851674 0.000058277511 0.000184545452 -0.000031974048 0.000010498753 -0.000005727165 -0.001174576892 -0.000177549095 -0.011907502368
120,202 0.973869650226 1 6 9 0.000008319329 0.000049915975 0.000074873962 0.000019985144 0.000063286291 0.000146557725 -0.000011665815 -0.000013370316 -0.000071683763 -0.001127285290 -0.000196635434 -0.011692806643
89,740 0.706117162445 5 7 13 0.000055716514 0.000078003120 0.000144862937 0.000007530092 0.000060240737 0.000131776612 0.000048186422 0.000017762383 0.000013086325 -0.000775090524 -0.000120589808 -0.008418167598
47,567 0.051223735645 2 4 10 0.000042045956 0.000084091912 0.000210229781 0.000028579322 0.000061921864 0.000104790847 0.000013466634 0.000022170048 0.000105438934 -0.000058005736 -0.000008522149 -0.000605947036
26,504 0.008044895961 1 1 11 0.000037730154 0.000037730154 0.000415031693 0.000050980649 0.000080112449 0.000167507847 -0.000013250495 -0.000042382295 0.000247523846 -0.000009324973 -0.000001857755 -0.000094023387
14,263 0.002687829072 0 0 28 0.000000000000 0.000000000000 0.001963121363 0.000040501681 0.000067502801 0.000283511766 -0.000040501681 -0.000067502801 0.001679609597 -0.000003188754 -0.000000688202 -0.000027564355
10,032 0.001838215698 0 2 21 0.000000000000 0.000199362041 0.002093301435 0.000024529644 0.000024529644 0.000956656119 -0.000024529644 0.000174832397 0.001136645316 -0.000002151440 -0.000000025200 -0.000019849444
18,514 0.003935633580 2 0 15 0.000108026358 0.000000000000 0.000810197688 0.000000000000 0.000082321465 0.002016875900 0.000108026358 -0.000082321465 -0.001206678212 -0.000004084557 -0.000001066016 -0.000051720271
33,579 0.015099336268 8 4 71 0.000238244141 0.000119122070 0.002114416749 0.000070062355 0.000070062355 0.001261122399 0.000168181785 0.000049059715 0.000853294350 -0.000014762386 -0.000002106077 -0.000167324256
44,987 0.041045201687 12 3 99 0.000266743726 0.000066685931 0.002200635739 0.000191964371 0.000076785749 0.001650893594 0.000074779355 -0.000010099817 0.000549742145 -0.000043962975 -0.000008153258 -0.000467304377
56,699 0.109379979688 11 8 81 0.000194006949 0.000141095963 0.001428596624 0.000254563043 0.000089097065 0.002163785862 -0.000060556094 0.000051998898 -0.000735189237 -0.000131958445 -0.000014934987 -0.001385849511
70,860 0.305212030948 16 11 601 0.000225797347 0.000155235676 0.008481512842 0.000226186496 0.000108176150 0.001770155184 -0.000000389149 0.000047059526 0.006711357659 -0.000349850920 -0.000043181889 -0.001594275153
78,774 0.472442953055 15 7 978 0.000190418158 0.000088861807 0.012415263920 0.000211666758 0.000148950682 0.005346545520 -0.000021248600 -0.000060088874 0.007068718399 -0.000551395158 -0.000117463496 -0.002298973371
81,468 0.532982036922 9 10 1049 0.000110472824 0.000122747582 0.012876221338 0.000207172167 0.000120293516 0.010552414558 -0.000096699343 0.000002454066 0.002323806780 -0.000662265109 -0.000099181065 -0.005122517665
80,945 0.521249692402 9 11 966 0.000111186608 0.000135894743 0.011934029279 0.000149773468 0.000106089540 0.012649617454 -0.000038586859 0.000029805204 -0.000715588174 -0.000617395788 -0.000082741055 -0.006594041186
82,375 0.553234966201 10 21 868 0.000121396055 0.000254931715 0.010537177542 0.000110828567 0.000129299994 0.012406642325 0.000010567488 0.000125631720 -0.001869464784 -0.000628086940 -0.000034803688 -0.007637034058
84,356 0.596773691157 13 11 949 0.000154108777 0.000130399734 0.011249940727 0.000116336027 0.000195934362 0.011229488121 0.000037772750 -0.000065534628 0.000020452606 -0.000661281013 -0.000151625731 -0.007110203695
86,172 0.635406055845 7 20 789 0.000081232883 0.000232093952 0.009156106392 0.000137946753 0.000191925917 0.010897793452 -0.000056713869 0.000040168035 -0.001741687060 -0.000764126653 -0.000094277166 -0.008690159564
89,045 0.692971862621 19 11 254 0.000213375260 0.000123533045 0.002852490314 0.000117282792 0.000181788328 0.010191874648 0.000096092467 -0.000058255283 -0.007339384334 -0.000727463653 -0.000171022974 -0.013356507622
99,441 0.851913697642 30 22 178 0.000301686427 0.000221236713 0.001790006134 0.000148387428 0.000176923472 0.005952618753 0.000153298999 0.000044313241 -0.004162612618 -0.000845581587 -0.000122869719 -0.013713655731
114,784 0.958135861164 81 35 175 0.000705673265 0.000304920546 0.001524602732 0.000259966257 0.000175079316 0.002291947413 0.000445707007 0.000129841230 -0.000767344680 -0.000670847623 -0.000056242484 -0.012170435920
132,495 0.991203360961 129 26 146 0.000973621646 0.000196233820 0.001101928375 0.000518146808 0.000266075388 0.001647800210 0.000455474838 -0.000069841568 -0.000545871835 -0.000684318230 -0.000256109804 -0.012370941376
152,910 0.998588796229 355 55 35 0.002321627101 0.000359688706 0.000228892813 0.000849243163 0.000246684919 0.001298128834 0.001472383938 0.000113003787 -0.001069236021 0.000326056964 -0.000075430750 -0.012985742611
153,378 0.998646922963 208 46 61 0.001356126694 0.000299912634 0.000397710232 0.001695835742 0.000283807221 0.000634186507 -0.000339709049 0.000016105413 -0.000236476275 -0.002657735157 -0.000343118398 -0.013094955123
153,392 0.998648624464 2163 41 36 0.014101126526 0.000267289037 0.000234692813 0.001838139268 0.000329755002 0.000313430497 0.012262987258 -0.000062465965 -0.000078737684 0.009927925646 -0.000421584181 -0.012937452007
119,355 0.971858093513 397 87 48 0.003326211721 0.000728917934 0.000402161619 0.007728917430 0.000283600091 0.000316197803 -0.004402705709 0.000445317843 0.000085963816 -0.006535097264 0.000083219370 -0.012430315292
82,169 0.548648112120 116 89 48 0.001411724616 0.001083133542 0.000584161910 0.009385987747 0.000469299387 0.000307977723 -0.007974263131 0.000613834155 0.000276184187 -0.005648820738 0.000139436421 -0.006912986596
49,355 0.059633623157 38 44 36 0.000769932124 0.000891500355 0.000729409381 0.002545602509 0.000873345110 0.000476370060 -0.001775670384 0.000018155244 0.000253039321 -0.000244336694 -0.000020366878 -0.000752766079
27,905 0.009116147803 17 46 67 0.000609209819 0.001648450099 0.002401003404 0.001170888963 0.001011222286 0.000638666707 -0.000561679144 0.000637227812 0.001762336697 -0.000026284646 0.000002530088 -0.000101315815
15,213 0.002927046635 3 40 38 0.000197199763 0.002629330178 0.002497863669 0.000711881957 0.001164897748 0.001333160756 -0.000514682194 0.001464432430 0.001164702913 -0.000008302010 0.000003233636 -0.000034280161
10,524 0.001921280998 11 43 59 0.001045229951 0.004085898898 0.005606233371 0.000463843406 0.001994526648 0.002435177884 0.000581386544 0.002091372250 0.003171055487 -0.000003343491 0.000003327051 -0.000018646351
19,994 0.004493856737 33 98 43 0.001650495149 0.004901470441 0.002150645194 0.000543963943 0.003224929090 0.003768893033 0.001106531206 0.001676541351 -0.001618247840 -0.000005460467 0.000005917747 -0.000065136068
37,977 0.022268325137 58 242 134 0.001527240172 0.006372277958 0.003528451431 0.001441772069 0.004620224130 0.003342289796 0.000085468103 0.001752053828 0.000186161635 -0.000049795516 0.000031005639 -0.000282586375
47,841 0.052435557917 57 398 248 0.001191446667 0.008319224096 0.005183838130 0.001569750392 0.005865001466 0.003053250763 -0.000378303725 0.002454222630 0.002130587367 -0.000141572379 0.000109828055 -0.000563453312
60,822 0.151097373647 48 519 164 0.000789188123 0.008533096577 0.002696392753 0.001340045212 0.007457642919 0.004451280617 -0.000550857089 0.001075453658 -0.001754887864 -0.000434024817 0.000108150204 -0.002210722174
73,817 0.364364626367 50 537 333 0.000677350746 0.007274747009 0.004511155967 0.000966290274 0.008438935056 0.003791538978 -0.000288939528 -0.001164188047 0.000719616988 -0.000951198134 -0.000555246784 -0.004429436602
79,683 0.492867983759 30 575 615 0.000376491849 0.007216093772 0.007718082904 0.000727872310 0.007843195508 0.003691352431 -0.000351380461 -0.000627101736 0.004026730472 -0.001317439848 -0.000486357416 -0.004361630721
82,518 0.556413771330 29 570 698 0.000351438474 0.006907583800 0.008458760513 0.000521172638 0.007244299674 0.006175895765 -0.000169734164 -0.000336715874 0.002282864747 -0.001386227717 -0.000387489099 -0.005894289492
82,125 0.547667296341 21 576 952 0.000255707763 0.007013698630 0.011592085236 0.000363746216 0.007059142669 0.008094894606 -0.000108038454 -0.000045444039 0.003497190630 -0.001330648365 -0.000221877955 -0.005136588390
80,785 0.517655156915 13 484 826 0.000160920963 0.005991211240 0.010224670421 0.000303687372 0.006960514568 0.010021683278 -0.000142766409 -0.000969303328 0.000202987143 -0.001275706009 -0.000687959580 -0.006560365036
83,266 0.572960435379 11 474 808 0.000132106742 0.005692599620 0.009703840703 0.000208704192 0.006506660119 0.010914001596 -0.000076597450 -0.000814060499 -0.001210160893 -0.001374087783 -0.000672511855 -0.008070939696
82,325 0.552122454395 10 450 573 0.000121469784 0.005466140298 0.006960218646 0.000146295969 0.005839647427 0.009960317218 -0.000024826185 -0.000373507129 -0.003000098573 -0.001295529547 -0.000404813884 -0.008765672716
85,047 0.611644482390 9 494 236 0.000105823839 0.005808552918 0.002774936212 0.000126818487 0.005580013407 0.008339825232 -0.000020994648 0.000228539511 -0.005564889020 -0.001432851667 -0.000080216681 -0.011279403400
88,557 0.683549014676 18 656 151 0.000203258918 0.007407658344 0.001705116479 0.000113519585 0.005640130966 0.004833544440 0.000089739333 0.001767527379 -0.003128427961 -0.001525604685 0.000962326738 -0.010939962259
90,514 0.720362412173 18 643 90 0.000198864264 0.007103873434 0.000994321320 0.000155526370 0.006624271330 0.002229211308 0.000043337894 0.000479602104 -0.001234889988 -0.001641193909 0.000086381105 -0.010165114194
101,491 0.873712464645 15 657 87 0.000147796356 0.006473480407 0.000857218867 0.000201037577 0.007254105913 0.001345834892 -0.000053241221 -0.000780625506 -0.000488616026 -0.002074952055 -0.000996306741 -0.011677024993
115,342 0.960104553736 8 690 52 0.000069358950 0.005982209429 0.000450833174 0.000171870524 0.006770657014 0.000921850993 -0.000102511574 -0.000788447585 -0.000471017819 -0.002327426581 -0.001102330881 -0.012814745095
129,015 0.988006802870 14 618 23 0.000108514514 0.004790140681 0.000178273844 0.000106072415 0.006212154054 0.000641046335 0.000002442099 -0.001422013373 -0.000462772491 -0.013929596125 -0.009835716354 -0.011910281442
117,151 0.965894303881 9 552 25 0.000076823928 0.004711867590 0.000213399800 0.000090032207 0.005352823942 0.000306927978 -0.000013208279 -0.000640956352 -0.000093528178 -0.013632955591 -0.008861165468 -0.011287067434
93,636 0.773335144678 3 435 28 0.000032038959 0.004645649109 0.000299030287 0.000093432887 0.004752890326 0.000194990372 -0.000061393927 -0.000107241217 0.000104039915 -0.010952374803 -0.006681876878 -0.008884109190
71,741 0.322280762354 1 300 160 0.000013939031 0.004181709204 0.002230244909 0.000056929507 0.004682451954 0.000251438656 -0.000042990476 -0.000500742750 0.001978806253 -0.004558376810 -0.002911432625 -0.003098174879
50,220 0.064151868077 1 201 28 0.000019912386 0.004002389486 0.000557546794 0.000024187160 0.004444390695 0.001136796532 -0.000004274775 -0.000442001209 -0.000579249737 -0.000904887843 -0.000575769289 -0.000780813876
32,794 0.014083885541 0 112 48 0.000000000000 0.003415258889 0.001463682381 0.000016398685 0.004107870549 0.001541476374 -0.000016398685 -0.000692611660 -0.000077793993 -0.000198829609 -0.000129933819 -0.000164357243
20,598 0.004743713550 2 96 58 0.000097096806 0.004660646665 0.002815807360 0.000012046161 0.003770448358 0.000915508228 0.000085050645 0.000890198307 0.001900299132 -0.000066488249 -0.000036255720 -0.000045975057
14,780 0.002815488608 0 49 37 0.000000000000 0.003315290934 0.002503382950 0.000037458795 0.003895714714 0.001985316152 -0.000037458795 -0.000580423780 0.000518066798 -0.000039807026 -0.000025659013 -0.000031178773
20,730 0.004800132821 0 19 159 0.000000000000 0.000916546068 0.007670043415 0.000056532308 0.004098592346 0.002685284640 -0.000056532308 -0.003182046277 0.004984758775 -0.000067958643 -0.000056234242 -0.000031716045
33,437 0.014910453545 5 21 89 0.000149534946 0.000628046775 0.002661722044 0.000000000000 0.001914953534 0.005519571952 0.000149534946 -0.001286906760 -0.002857849907 -0.000208024558 -0.000146420704 -0.000215455087
43,474 0.036008064557 3 25 135 0.000069006763 0.000575056356 0.003105304320 0.000092307124 0.000738456994 0.004578433363 -0.000023300362 -0.000163400639 -0.001473129044 -0.000508593275 -0.000313144033 -0.000470453079
57,497 0.116575160226 4 35 138 0.000069568847 0.000608727412 0.002400125224 0.000104016331 0.000598093901 0.002912457256 -0.000034447483 0.000010633511 -0.000512332032 -0.001647856805 -0.000993507497 -0.001411074382
72,142 0.330213409302 3 27 326 0.000041584652 0.000374261872 0.004518865571 0.000069326836 0.000594230026 0.002703746620 -0.000027742184 -0.000219968154 0.001815118950 -0.004665541906 -0.002890379347 -0.003228485370
80,110 0.502474979786 4 39 680 0.000049931344 0.000486830608 0.008488328548 0.000053996097 0.000478251144 0.003579169849 -0.000004064752 0.000008579464 0.004909158700 -0.007087505702 -0.004283356564 -0.003358003376
84,234 0.594128736584 6 45 711 0.000071230145 0.000534226084 0.008440772135 0.000045976408 0.000433491842 0.006607466569 0.000025253737 0.000100734242 0.001833305566 -0.008362880516 -0.005009908780 -0.005797971436
86,018 0.632189190487 3 39 832 0.000034876421 0.000453393476 0.009672394150 0.000060847977 0.000511123010 0.008463953658 -0.000025971556 -0.000057729534 0.001208440492 -0.008930998700 -0.005431027405 -0.006564427965
84,815 0.606673333129 0 28 862 0.000000000000 0.000330130284 0.010163296587 0.000052862815 0.000493386274 0.009063035970 -0.000052862815 -0.000163255991 0.001100260616 -0.008586847890 -0.005275845198 -0.006365110212
85,221 0.615357720536 3 34 912 0.000035202591 0.000398962697 0.010701587637 0.000017561010 0.000392195887 0.009916116909 0.000017641581 0.000006766810 0.000785470727 -0.008666381193 -0.005246742850 -0.006649933671
85,339 0.617868296609 5 14 650 0.000058589859 0.000164051606 0.007616681705 0.000017643323 0.000364628667 0.010433084759 0.000040946537 -0.000200577061 -0.002816403054 -0.008687339460 -0.005396260054 -0.008902548116
84,941 0.609375994694 17 18 353 0.000200138920 0.000211911798 0.004155825809 0.000046904315 0.000281425891 0.009158067542 0.000153234605 -0.000069514094 -0.005002241734 -0.008499510513 -0.005242224434 -0.010112184503
86,959 0.651652295784 8 22 273 0.000091997378 0.000252992790 0.003139410527 0.000129198966 0.000187925769 0.005890298332 -0.000037201588 0.000065067020 -0.002750887805 -0.009213273974 -0.005518210901 -0.009346631311
91,068 0.730295040770 5 19 177 0.000054904028 0.000208635305 0.001943602583 0.000145433392 0.000232693426 0.003641652123 -0.000090529364 -0.000024058121 -0.001698049541 -0.010364095916 -0.006249247639 -0.009705719518
96,702 0.818048283021 7 18 166 0.000072387334 0.000186138860 0.001716613927 0.000073022631 0.000230302145 0.002527706471 -0.000000635297 -0.000044163286 -0.000811092544 -0.011535922047 -0.007016612704 -0.010146398287
105,291 0.906885813868 10 25 67 0.000094974879 0.000237437198 0.000636331690 0.000063907973 0.000197049582 0.001826702881 0.000031066907 0.000040387616 -0.001190371191 -0.012759937469 -0.007701917278 -0.011592228400
#bostonstrong#happyhalloween#jobs
1 12 3624 48 60 72
#bostonstrong#happyhalloween#jobs
1 12 3624 48 60 72
Tren
d sc
ore
#bostonstrong#happyhalloween#jobs
24 36 6048 72
Twee
t cou
nt
tweets per hour
1 12 3624 48 60 72
Figure B.2: Raw tweet count for hashtags
Applying the knowledge learned and the combined trend_score equation,3.5, results in the expected trend #bostonstrong:
total tweets hour Wx constants count bostonstrong count happyhalloween count jobs fraction bostonstrong fraction happyhalloween fraction jobs ref fraction bostonstrong ref fraction happyhalloween ref fraction jobs A bostonstrong A happyhalloween A jobs FINAL bostonstrong FINAL happyhalloween FINAL jobs
157,676 80000 65 8 15 0.000412237753 0.000050736954 0.000095131789
155,341 0.00009 178 10 15 0.001145866191 0.000064374505 0.000096561758
130,396 9 9 20 0.000069020522 0.000069020522 0.000153378938
90,009 5 8 44 0.000055550001 0.000088880001 0.000488840005
54,489 1 3 34 0.000018352328 0.000055056984 0.000623979152
29,956 0 2 44 0.000000000000 0.000066764588 0.001468820937
15,455 4 0 34 0.000258815917 0.000000000000 0.002199935296
10,737 1 1 46 0.000093135885 0.000093135885 0.004284250722
20,177 7 0 25 0.000346929672 0.000000000000 0.001239034544
34,251 5 2 98 0.000145981139 0.000058392456 0.002861230329
43,133 6 2 94 0.000139104630 0.000046368210 0.002179305868
58,313 2 3 94 0.000034297669 0.000051446504 0.001611990465
75,845 4 6 296 0.000052739139 0.000079108709 0.003902696288
85,045 4 5 1015 0.000047033923 0.000058792404 0.011934858016
89,980 4 8 691 0.000044454323 0.000088908646 0.007679484330
90,166 10 17 818 0.000110906550 0.000188541135 0.009072155802
90,364 2 6 807 0.000022132708 0.000066398123 0.008930547563
92,980 2 8 656 0.000021510002 0.000086040009 0.007055280706
94,524 3 3 289 0.000031737971 0.000031737971 0.003057424569
99,682 5 6 222 0.000050159507 0.000060191409 0.002227082121
108,332 7 5 251 0.000064616180 0.000046154414 0.002316951593
128,469 2 6 175 0.000015567958 0.000046703874 0.001362196328
149,266 4 14 148 0.000026797797 0.000093792290 0.000991518497
154,043 7 9 39 0.000045441857 0.000058425245 0.000253176061
154,824 0.998811839382 5 9 18 0.000032294735 0.000058130522 0.000116261045 0.000036266646 0.000075830259 0.000616532975 -0.000003971911 -0.000017699737 -0.000500271930 -0.001148471910 -0.000205995824 -0.012420355014
145,399 0.997229380851 1 10 26 0.000006877626 0.000068776264 0.000178818286 0.000038851674 0.000058277511 0.000184545452 -0.000031974048 0.000010498753 -0.000005727165 -0.001174576892 -0.000177549095 -0.011907502368
120,202 0.973869650226 1 6 9 0.000008319329 0.000049915975 0.000074873962 0.000019985144 0.000063286291 0.000146557725 -0.000011665815 -0.000013370316 -0.000071683763 -0.001127285290 -0.000196635434 -0.011692806643
89,740 0.706117162445 5 7 13 0.000055716514 0.000078003120 0.000144862937 0.000007530092 0.000060240737 0.000131776612 0.000048186422 0.000017762383 0.000013086325 -0.000775090524 -0.000120589808 -0.008418167598
47,567 0.051223735645 2 4 10 0.000042045956 0.000084091912 0.000210229781 0.000028579322 0.000061921864 0.000104790847 0.000013466634 0.000022170048 0.000105438934 -0.000058005736 -0.000008522149 -0.000605947036
26,504 0.008044895961 1 1 11 0.000037730154 0.000037730154 0.000415031693 0.000050980649 0.000080112449 0.000167507847 -0.000013250495 -0.000042382295 0.000247523846 -0.000009324973 -0.000001857755 -0.000094023387
14,263 0.002687829072 0 0 28 0.000000000000 0.000000000000 0.001963121363 0.000040501681 0.000067502801 0.000283511766 -0.000040501681 -0.000067502801 0.001679609597 -0.000003188754 -0.000000688202 -0.000027564355
10,032 0.001838215698 0 2 21 0.000000000000 0.000199362041 0.002093301435 0.000024529644 0.000024529644 0.000956656119 -0.000024529644 0.000174832397 0.001136645316 -0.000002151440 -0.000000025200 -0.000019849444
18,514 0.003935633580 2 0 15 0.000108026358 0.000000000000 0.000810197688 0.000000000000 0.000082321465 0.002016875900 0.000108026358 -0.000082321465 -0.001206678212 -0.000004084557 -0.000001066016 -0.000051720271
33,579 0.015099336268 8 4 71 0.000238244141 0.000119122070 0.002114416749 0.000070062355 0.000070062355 0.001261122399 0.000168181785 0.000049059715 0.000853294350 -0.000014762386 -0.000002106077 -0.000167324256
44,987 0.041045201687 12 3 99 0.000266743726 0.000066685931 0.002200635739 0.000191964371 0.000076785749 0.001650893594 0.000074779355 -0.000010099817 0.000549742145 -0.000043962975 -0.000008153258 -0.000467304377
56,699 0.109379979688 11 8 81 0.000194006949 0.000141095963 0.001428596624 0.000254563043 0.000089097065 0.002163785862 -0.000060556094 0.000051998898 -0.000735189237 -0.000131958445 -0.000014934987 -0.001385849511
70,860 0.305212030948 16 11 601 0.000225797347 0.000155235676 0.008481512842 0.000226186496 0.000108176150 0.001770155184 -0.000000389149 0.000047059526 0.006711357659 -0.000349850920 -0.000043181889 -0.001594275153
78,774 0.472442953055 15 7 978 0.000190418158 0.000088861807 0.012415263920 0.000211666758 0.000148950682 0.005346545520 -0.000021248600 -0.000060088874 0.007068718399 -0.000551395158 -0.000117463496 -0.002298973371
81,468 0.532982036922 9 10 1049 0.000110472824 0.000122747582 0.012876221338 0.000207172167 0.000120293516 0.010552414558 -0.000096699343 0.000002454066 0.002323806780 -0.000662265109 -0.000099181065 -0.005122517665
80,945 0.521249692402 9 11 966 0.000111186608 0.000135894743 0.011934029279 0.000149773468 0.000106089540 0.012649617454 -0.000038586859 0.000029805204 -0.000715588174 -0.000617395788 -0.000082741055 -0.006594041186
82,375 0.553234966201 10 21 868 0.000121396055 0.000254931715 0.010537177542 0.000110828567 0.000129299994 0.012406642325 0.000010567488 0.000125631720 -0.001869464784 -0.000628086940 -0.000034803688 -0.007637034058
84,356 0.596773691157 13 11 949 0.000154108777 0.000130399734 0.011249940727 0.000116336027 0.000195934362 0.011229488121 0.000037772750 -0.000065534628 0.000020452606 -0.000661281013 -0.000151625731 -0.007110203695
86,172 0.635406055845 7 20 789 0.000081232883 0.000232093952 0.009156106392 0.000137946753 0.000191925917 0.010897793452 -0.000056713869 0.000040168035 -0.001741687060 -0.000764126653 -0.000094277166 -0.008690159564
89,045 0.692971862621 19 11 254 0.000213375260 0.000123533045 0.002852490314 0.000117282792 0.000181788328 0.010191874648 0.000096092467 -0.000058255283 -0.007339384334 -0.000727463653 -0.000171022974 -0.013356507622
99,441 0.851913697642 30 22 178 0.000301686427 0.000221236713 0.001790006134 0.000148387428 0.000176923472 0.005952618753 0.000153298999 0.000044313241 -0.004162612618 -0.000845581587 -0.000122869719 -0.013713655731
114,784 0.958135861164 81 35 175 0.000705673265 0.000304920546 0.001524602732 0.000259966257 0.000175079316 0.002291947413 0.000445707007 0.000129841230 -0.000767344680 -0.000670847623 -0.000056242484 -0.012170435920
132,495 0.991203360961 129 26 146 0.000973621646 0.000196233820 0.001101928375 0.000518146808 0.000266075388 0.001647800210 0.000455474838 -0.000069841568 -0.000545871835 -0.000684318230 -0.000256109804 -0.012370941376
152,910 0.998588796229 355 55 35 0.002321627101 0.000359688706 0.000228892813 0.000849243163 0.000246684919 0.001298128834 0.001472383938 0.000113003787 -0.001069236021 0.000326056964 -0.000075430750 -0.012985742611
153,378 0.998646922963 208 46 61 0.001356126694 0.000299912634 0.000397710232 0.001695835742 0.000283807221 0.000634186507 -0.000339709049 0.000016105413 -0.000236476275 -0.002657735157 -0.000343118398 -0.013094955123
153,392 0.998648624464 2163 41 36 0.014101126526 0.000267289037 0.000234692813 0.001838139268 0.000329755002 0.000313430497 0.012262987258 -0.000062465965 -0.000078737684 0.009927925646 -0.000421584181 -0.012937452007
119,355 0.971858093513 397 87 48 0.003326211721 0.000728917934 0.000402161619 0.007728917430 0.000283600091 0.000316197803 -0.004402705709 0.000445317843 0.000085963816 -0.006535097264 0.000083219370 -0.012430315292
82,169 0.548648112120 116 89 48 0.001411724616 0.001083133542 0.000584161910 0.009385987747 0.000469299387 0.000307977723 -0.007974263131 0.000613834155 0.000276184187 -0.005648820738 0.000139436421 -0.006912986596
49,355 0.059633623157 38 44 36 0.000769932124 0.000891500355 0.000729409381 0.002545602509 0.000873345110 0.000476370060 -0.001775670384 0.000018155244 0.000253039321 -0.000244336694 -0.000020366878 -0.000752766079
27,905 0.009116147803 17 46 67 0.000609209819 0.001648450099 0.002401003404 0.001170888963 0.001011222286 0.000638666707 -0.000561679144 0.000637227812 0.001762336697 -0.000026284646 0.000002530088 -0.000101315815
15,213 0.002927046635 3 40 38 0.000197199763 0.002629330178 0.002497863669 0.000711881957 0.001164897748 0.001333160756 -0.000514682194 0.001464432430 0.001164702913 -0.000008302010 0.000003233636 -0.000034280161
10,524 0.001921280998 11 43 59 0.001045229951 0.004085898898 0.005606233371 0.000463843406 0.001994526648 0.002435177884 0.000581386544 0.002091372250 0.003171055487 -0.000003343491 0.000003327051 -0.000018646351
19,994 0.004493856737 33 98 43 0.001650495149 0.004901470441 0.002150645194 0.000543963943 0.003224929090 0.003768893033 0.001106531206 0.001676541351 -0.001618247840 -0.000005460467 0.000005917747 -0.000065136068
37,977 0.022268325137 58 242 134 0.001527240172 0.006372277958 0.003528451431 0.001441772069 0.004620224130 0.003342289796 0.000085468103 0.001752053828 0.000186161635 -0.000049795516 0.000031005639 -0.000282586375
47,841 0.052435557917 57 398 248 0.001191446667 0.008319224096 0.005183838130 0.001569750392 0.005865001466 0.003053250763 -0.000378303725 0.002454222630 0.002130587367 -0.000141572379 0.000109828055 -0.000563453312
60,822 0.151097373647 48 519 164 0.000789188123 0.008533096577 0.002696392753 0.001340045212 0.007457642919 0.004451280617 -0.000550857089 0.001075453658 -0.001754887864 -0.000434024817 0.000108150204 -0.002210722174
73,817 0.364364626367 50 537 333 0.000677350746 0.007274747009 0.004511155967 0.000966290274 0.008438935056 0.003791538978 -0.000288939528 -0.001164188047 0.000719616988 -0.000951198134 -0.000555246784 -0.004429436602
79,683 0.492867983759 30 575 615 0.000376491849 0.007216093772 0.007718082904 0.000727872310 0.007843195508 0.003691352431 -0.000351380461 -0.000627101736 0.004026730472 -0.001317439848 -0.000486357416 -0.004361630721
82,518 0.556413771330 29 570 698 0.000351438474 0.006907583800 0.008458760513 0.000521172638 0.007244299674 0.006175895765 -0.000169734164 -0.000336715874 0.002282864747 -0.001386227717 -0.000387489099 -0.005894289492
82,125 0.547667296341 21 576 952 0.000255707763 0.007013698630 0.011592085236 0.000363746216 0.007059142669 0.008094894606 -0.000108038454 -0.000045444039 0.003497190630 -0.001330648365 -0.000221877955 -0.005136588390
80,785 0.517655156915 13 484 826 0.000160920963 0.005991211240 0.010224670421 0.000303687372 0.006960514568 0.010021683278 -0.000142766409 -0.000969303328 0.000202987143 -0.001275706009 -0.000687959580 -0.006560365036
83,266 0.572960435379 11 474 808 0.000132106742 0.005692599620 0.009703840703 0.000208704192 0.006506660119 0.010914001596 -0.000076597450 -0.000814060499 -0.001210160893 -0.001374087783 -0.000672511855 -0.008070939696
82,325 0.552122454395 10 450 573 0.000121469784 0.005466140298 0.006960218646 0.000146295969 0.005839647427 0.009960317218 -0.000024826185 -0.000373507129 -0.003000098573 -0.001295529547 -0.000404813884 -0.008765672716
85,047 0.611644482390 9 494 236 0.000105823839 0.005808552918 0.002774936212 0.000126818487 0.005580013407 0.008339825232 -0.000020994648 0.000228539511 -0.005564889020 -0.001432851667 -0.000080216681 -0.011279403400
88,557 0.683549014676 18 656 151 0.000203258918 0.007407658344 0.001705116479 0.000113519585 0.005640130966 0.004833544440 0.000089739333 0.001767527379 -0.003128427961 -0.001525604685 0.000962326738 -0.010939962259
90,514 0.720362412173 18 643 90 0.000198864264 0.007103873434 0.000994321320 0.000155526370 0.006624271330 0.002229211308 0.000043337894 0.000479602104 -0.001234889988 -0.001641193909 0.000086381105 -0.010165114194
101,491 0.873712464645 15 657 87 0.000147796356 0.006473480407 0.000857218867 0.000201037577 0.007254105913 0.001345834892 -0.000053241221 -0.000780625506 -0.000488616026 -0.002074952055 -0.000996306741 -0.011677024993
115,342 0.960104553736 8 690 52 0.000069358950 0.005982209429 0.000450833174 0.000171870524 0.006770657014 0.000921850993 -0.000102511574 -0.000788447585 -0.000471017819 -0.002327426581 -0.001102330881 -0.012814745095
129,015 0.988006802870 14 618 23 0.000108514514 0.004790140681 0.000178273844 0.000106072415 0.006212154054 0.000641046335 0.000002442099 -0.001422013373 -0.000462772491 -0.013929596125 -0.009835716354 -0.011910281442
117,151 0.965894303881 9 552 25 0.000076823928 0.004711867590 0.000213399800 0.000090032207 0.005352823942 0.000306927978 -0.000013208279 -0.000640956352 -0.000093528178 -0.013632955591 -0.008861165468 -0.011287067434
93,636 0.773335144678 3 435 28 0.000032038959 0.004645649109 0.000299030287 0.000093432887 0.004752890326 0.000194990372 -0.000061393927 -0.000107241217 0.000104039915 -0.010952374803 -0.006681876878 -0.008884109190
71,741 0.322280762354 1 300 160 0.000013939031 0.004181709204 0.002230244909 0.000056929507 0.004682451954 0.000251438656 -0.000042990476 -0.000500742750 0.001978806253 -0.004558376810 -0.002911432625 -0.003098174879
50,220 0.064151868077 1 201 28 0.000019912386 0.004002389486 0.000557546794 0.000024187160 0.004444390695 0.001136796532 -0.000004274775 -0.000442001209 -0.000579249737 -0.000904887843 -0.000575769289 -0.000780813876
32,794 0.014083885541 0 112 48 0.000000000000 0.003415258889 0.001463682381 0.000016398685 0.004107870549 0.001541476374 -0.000016398685 -0.000692611660 -0.000077793993 -0.000198829609 -0.000129933819 -0.000164357243
20,598 0.004743713550 2 96 58 0.000097096806 0.004660646665 0.002815807360 0.000012046161 0.003770448358 0.000915508228 0.000085050645 0.000890198307 0.001900299132 -0.000066488249 -0.000036255720 -0.000045975057
14,780 0.002815488608 0 49 37 0.000000000000 0.003315290934 0.002503382950 0.000037458795 0.003895714714 0.001985316152 -0.000037458795 -0.000580423780 0.000518066798 -0.000039807026 -0.000025659013 -0.000031178773
20,730 0.004800132821 0 19 159 0.000000000000 0.000916546068 0.007670043415 0.000056532308 0.004098592346 0.002685284640 -0.000056532308 -0.003182046277 0.004984758775 -0.000067958643 -0.000056234242 -0.000031716045
33,437 0.014910453545 5 21 89 0.000149534946 0.000628046775 0.002661722044 0.000000000000 0.001914953534 0.005519571952 0.000149534946 -0.001286906760 -0.002857849907 -0.000208024558 -0.000146420704 -0.000215455087
43,474 0.036008064557 3 25 135 0.000069006763 0.000575056356 0.003105304320 0.000092307124 0.000738456994 0.004578433363 -0.000023300362 -0.000163400639 -0.001473129044 -0.000508593275 -0.000313144033 -0.000470453079
57,497 0.116575160226 4 35 138 0.000069568847 0.000608727412 0.002400125224 0.000104016331 0.000598093901 0.002912457256 -0.000034447483 0.000010633511 -0.000512332032 -0.001647856805 -0.000993507497 -0.001411074382
72,142 0.330213409302 3 27 326 0.000041584652 0.000374261872 0.004518865571 0.000069326836 0.000594230026 0.002703746620 -0.000027742184 -0.000219968154 0.001815118950 -0.004665541906 -0.002890379347 -0.003228485370
80,110 0.502474979786 4 39 680 0.000049931344 0.000486830608 0.008488328548 0.000053996097 0.000478251144 0.003579169849 -0.000004064752 0.000008579464 0.004909158700 -0.007087505702 -0.004283356564 -0.003358003376
84,234 0.594128736584 6 45 711 0.000071230145 0.000534226084 0.008440772135 0.000045976408 0.000433491842 0.006607466569 0.000025253737 0.000100734242 0.001833305566 -0.008362880516 -0.005009908780 -0.005797971436
86,018 0.632189190487 3 39 832 0.000034876421 0.000453393476 0.009672394150 0.000060847977 0.000511123010 0.008463953658 -0.000025971556 -0.000057729534 0.001208440492 -0.008930998700 -0.005431027405 -0.006564427965
84,815 0.606673333129 0 28 862 0.000000000000 0.000330130284 0.010163296587 0.000052862815 0.000493386274 0.009063035970 -0.000052862815 -0.000163255991 0.001100260616 -0.008586847890 -0.005275845198 -0.006365110212
85,221 0.615357720536 3 34 912 0.000035202591 0.000398962697 0.010701587637 0.000017561010 0.000392195887 0.009916116909 0.000017641581 0.000006766810 0.000785470727 -0.008666381193 -0.005246742850 -0.006649933671
85,339 0.617868296609 5 14 650 0.000058589859 0.000164051606 0.007616681705 0.000017643323 0.000364628667 0.010433084759 0.000040946537 -0.000200577061 -0.002816403054 -0.008687339460 -0.005396260054 -0.008902548116
84,941 0.609375994694 17 18 353 0.000200138920 0.000211911798 0.004155825809 0.000046904315 0.000281425891 0.009158067542 0.000153234605 -0.000069514094 -0.005002241734 -0.008499510513 -0.005242224434 -0.010112184503
86,959 0.651652295784 8 22 273 0.000091997378 0.000252992790 0.003139410527 0.000129198966 0.000187925769 0.005890298332 -0.000037201588 0.000065067020 -0.002750887805 -0.009213273974 -0.005518210901 -0.009346631311
91,068 0.730295040770 5 19 177 0.000054904028 0.000208635305 0.001943602583 0.000145433392 0.000232693426 0.003641652123 -0.000090529364 -0.000024058121 -0.001698049541 -0.010364095916 -0.006249247639 -0.009705719518
96,702 0.818048283021 7 18 166 0.000072387334 0.000186138860 0.001716613927 0.000073022631 0.000230302145 0.002527706471 -0.000000635297 -0.000044163286 -0.000811092544 -0.011535922047 -0.007016612704 -0.010146398287
105,291 0.906885813868 10 25 67 0.000094974879 0.000237437198 0.000636331690 0.000063907973 0.000197049582 0.001826702881 0.000031066907 0.000040387616 -0.001190371191 -0.012759937469 -0.007701917278 -0.011592228400
#bostonstrong#happyhalloween#jobs
1 12 3624 48 60 72
#bostonstrong#happyhalloween#jobs
1 12 3624 48 60 72
Tren
d sc
ore
#bostonstrong#happyhalloween#jobs
24 36 6048 72
Twee
t cou
nt
tweets per hour
1 12 3624 48 60 72
Figure B.3: Fully processed data
Which might be noted on figure B.3, #happyhalloween are actually spot-ted as a trend. Unaffected by the fact, that it is the settings for spottingfast trends - which is in focus at the moment, as described in chapter3. Never the less satisfy #happyhalloween the requirements, for beingconsidered as a slow trend - described in section 1.3.
1en.wikipedia.org/wiki/World_Series
45
Two tag clouds with top 100 of the words, in all the tweets containing thehashtag #bostonstrong, has been created to give an overall perspectiveof the semantic richness of the data.
Figure B.4: Top: Tag cloud for all words, in the tweets containing#bostonstrong. Bottom: Tag cloud for the rest of thewords, after removing #bostonstrong, to get a deeper un-derstanding.
The documents similar to those tweets, computed using LDA and Solr,can be seen at figure B.5 and figure B.6. Although all documents isnot about baseball or Boston Red Sox, is the overall impressions of thedocument positive.
The documents computed using Solr, is clearly different than the oneusing LDA, magazines/documents about the city Boston, is among theselected too.
46 Example: #bostonstrong
Figure B.5: Subset of the similar #bostonstrong documents using LDA
Figure B.6: Subset of the similar #bostonstrong documents using Solr
Appendix C
Implementation details
C.1 Flask
The existing Python code will be used, as data provider for the website,for reusing as much of the code as possible. It could properly have beenoptimized by writing it all in JavaScript, but this is just a debug tool. Aweb application framework supporting execution of Python code wouldtherefore be preferable.
Flask is a lightweight Python web application micro-framework, whichmeans that the core is kept simple but extensible. Flask supports bothlocal server and broadcasting, plus opportunity for custom port number(default is 5000). In addition it supports implementation of the decoratorapp.route(), which is a URL trigger. As soon as the url defined in theroute decorator is matching, the function attached is executed.
48 Implementation details
The interactive tweet-map needs three different app.route()’s:
/ : The website’s index page, is loading the HTML which displays thedialog showed in Figure ?? and discussed in the associated section.
/get_tweets/. . . : To receive the tweets from the database, matchingthe settings specified in the initial dialog of the website, this urlneeds three parameters: country_code, timestamp and dura-tion. After receiving the tweets, will they be plotted on the map.
If no tweets are available from the time period and country, an errormessage will be displayed and a new time period can be selected.
/related_documents : When the tweets is loaded and displayed on themap the sidebar, where its possible to see the related documents, tothe tweets on the map, appears. This url is triggered by the sidebarand will calculate all the documents, using the trending frameworkin combination with LDA and Solr, which will be discussed later inthe report.
C.2 Peewee
Peewee1, a ORM (object-relational mapping) module for Python whichsupports the database choice. was the chosen.
ORM: A programming technique for converting objects between incom-patible type systems in object-oriented programming languages.For databases uses, it creates a "virtual object database" which syn-chronize the state of the objects in the programming language withthe database tables. Usually ORM modules/libraries provides ansimple abstraction the SQL language on top too.
1peewee.readthedocs.org - open-source, MIT (Massachusetts Institute of Tech-nology) license, which means that all code is free to use and free to change/modifyhow ever it fits the integration implementation.
C.3 Database 49
Besides providing an extended abstraction of the SQL language on topof the database connection, Peewee also provides a Python script whichmakes it possible to grab a database from a running MySQL server andauto generate the Python classes, this includes tables, foreign keys etc.
C.3 Database
There is a ton of different databases to choose from and each have theirown advantages. What where preferred in this project is a regular rela-tional database, which is easy and fast to setup using predefined scripts.The database is not the main focus in this project, and is mainly used asa placeholder for the data to be saved, instead of keeping it in memory.
C.4 MySQL
Figure C.1: E/R Diagram
By examining the json result callbackfrom the Twitter service and investigatingthe different analysis models and frame-work for achieving the goal of the project(described in the section 1.1), a solu-tion for the table(s) for the database wasfound.
It turned out that all information needed,was able to fit into a single table, namedtweet (see figure C.1).
The desired information from the tweet is attributes such as the text,time stamp and longitude/latitude, to analyze the tweets over time fromlocation.
50 Implementation details
Bibliography
[Bee12] Beevolve. An exhaustive study of twitter users across theworld. http://www.beevolve.com/twitter-statistics/,October 2012. Online; last read 6. January 2014.
[Ble09] David M. Blei. Topic models. http://videolectures.net/mlss09uk_blei_tm/, November 2009. Online Video Lecture;last viewed 26. December 2013.
[BNG11] H. Becker, M. Naaman, and L. Gravano. Beyond trendingtopics: Real-world event identification on twitter. In FifthInternational AAAI Conference on Weblogs and Social Media,2011.
[BNJ03] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latentdirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March2003.
[CDCS10] Mario Cataldi, Luigi Di Caro, and Claudio Schifanella.Emerging topic detection on twitter based on temporal andsocial terms evaluation. In Proceedings of the Tenth Interna-tional Workshop on Multimedia Data Mining, MDMKDD ’10,pages 4:1–4:10, New York, NY, USA, 2010. ACM.
[IHS06] Alexander Ihler, Jon Hutchins, and Padhraic Smyth. Adap-tive event detection with time-varying poisson processes. In
52 BIBLIOGRAPHY
Proceedings of the 12th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining, KDD ’06,pages 207–216, New York, NY, USA, 2006. ACM.
[Jac88] R. Jackson. The matthew effect in science. INTERNA-TIONAL JOURNAL OF DERMATOLOGY, 27(1):16–16,1988.
[KLPM10] Haewoon Kwak, Changhyun Lee, Hosung Park, and SueMoon. What is twitter, a social network or a news media?In Proceedings of the 19th International Conference on WorldWide Web, WWW ’10, pages 591–600, New York, NY, USA,2010. ACM.
[Nik12] Stanislav Nikolov. Trend or no trend: A novel nonpara-metric method for classifying time series. Master’s thesis,Massachusetts Institute of Technology, Massachusetts, USA,September 2012.
[RIS+94] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, PeterBergstrom, and John Riedl. Grouplens: An open architec-ture for collaborative filtering of netnews. In Proceedings ofthe 1994 ACM Conference on Computer Supported Coopera-tive Work, CSCW ’94, pages 175–186, New York, NY, USA,1994. ACM.
[ŘS10] Radim Řehůřek and Petr Sojka. Software Framework forTopic Modelling with Large Corpora. In Proceedings of theLREC 2010 Workshop on New Challenges for NLP Frame-works, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.
[Sag12] Jeff Saginor. Study finds facebook users more pri-vate than ever. http://www.digitaltrends.com/web/study-finds-facebook-users-more-private-than-ever/,February 2012. Online; last read 23. August 2013.
[Sal89] Gerard Salton. Automatic Text Processing: The Transfor-mation, Analysis, and Retrieval of Information by Computer.Addison-Wesley Longman Publishing Co., Inc., Boston, MA,USA, 1989.
BIBLIOGRAPHY 53
[SM95] Upendra Shardanand and Pattie Maes. Social informa-tion filtering: Algorithms for automating “word ofmouth”. In Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems, CHI ’95, pages 210–217, New York, NY, USA, 1995. ACM Press/Addison-WesleyPublishing Co.
[Wal05] Thomas Vander Wal. Explaining and showing broad and nar-row folksonomies, 2005.