Learning-to-Rank for Real-Time High-Precision Hashtag ...gdac.uqam.ca/chines [36] and Latent...

12
Learning-to-Rank for Real-Time High-Precision Hashtag Recommendation for Streaming News Bichen Shi bichen.shi@insight- centre.org Georgiana Ifrim georgiana.ifrim@insight- centre.org Neil Hurley neil.hurley@insight- centre.org Insight Centre for Data Analytics University College Dublin Ireland ABSTRACT We address the problem of real-time recommendation of streaming Twitter hashtags to an incoming stream of news articles. The technical challenge can be framed as large scale topic classification where the set of topics (i.e., hashtags) is huge and highly dynamic. Our main applications come from digital journalism, e.g., promoting original content to Twit- ter communities and social indexing of news to enable better retrieval and story tracking. In contrast to the state-of-the- art that focuses on topic modelling approaches, we propose a learning-to-rank approach for modelling hashtag relevance. This enables us to deal with the dynamic nature of the prob- lem, since a relevance model is stable over time, while a topic model needs to be continuously retrained. We present the data collection and processing pipeline, as well as our methodology for achieving low latency, high precision rec- ommendations. Our empirical results show that our method outperforms the state-of-the-art, delivering more than 80% precision. Our techniques are implemented in a real-time system that is currently under user trial with a big news organisation. Keywords learning-to-rank; dynamic topics; social indexing; news; hash- tag recommendation 1. INTRODUCTION Social media platforms such as Twitter have taken a cen- tral role in the consumption, production and dissemination of news [35]. Twitter has about 240 million active users and receives more than 500 million tweets a day, a quar- ter of which are tagged with hashtags [31]. Hashtags are keyword-based tags, describing the content of a tweet, for example #taiwan, #transasia, #ge235 were used for tweets describing a recent plane crash in Taiwan. They tend to appear spontaneously around breaking news or developing Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author’s site if the Material is used in electronic media. WWW 2016, April 11–15, 2016, Montréal, Québec, Canada. ACM 978-1-4503-4143-1/16/04. http://dx.doi.org/10.1145/2872427.2882982 . news stories, and are a way for news followers to connect to a particular story and community, to get updates in real- time (e.g., #parisattacks). News organisations use hashtags to target Twitter communities in order to promote origi- nal content and engage readers. Journalists sometimes in- troduce new hashtags, but the Twitter crowd is the one that most often creates and selects a few of the possibly many competing hashtags, thus echoing the current social discourse (e.g., #migrant, #refugee, #refugeeswelcome). However, an automatic approach to real-time, high preci- sion hashtag recommendation for news is currently missing, and both journalists and news readers have to invest signif- icant effort to manually search for relevant hashtags. Most existing approaches use topic modelling [20, 15], by con- sidering hashtags as topics, and mapping news articles to topics using content similarity, regardless of whether users actively engage with those hashtags. As the relevant hash- tags change quickly (some die-off and new ones emerge), and the news and Twitter environments are highly dynamic, such approaches need to continuously retrain to adapt to new content. In addition, since most existing approaches are trained on static collections of noisy tweets, they do not achieve high enough precision for practical use (e.g., pre- cision of 38% [6]), and cannot deliver recommendations in real-time. In this paper, we model the problem of recommending highly specific, actively used hashtags, to a stream of news articles, as an Information Retrieval (IR) learning-to-rank (L2R) problem. In our framework, a news article plays the role of the query in classic IR, and the hashtags (represented by tweets using those tags), play the role of documents. For incoming articles, we split the recommendation process into two steps: (1) a pre-ranking step based on automatic query formulation to connect each article to the hashtag stream and retrieve candidate hashtags and (2) a pre-trained L2R approach to score the relevance of the candidate article- hashtag pairs. From our analysis, the model of what makes a hashtag relevant to an article doesn’t change over time. This means that we can use a small amount of human-annotated data to train the L2R model once, and at test time com- pute features to describe and score the relevance of new article-hashtag pairs. Thus, we can recommend new hash- tags that were not previously seen in the training set, since the relevance model is not hashtag specific, but encodes gen- eral characteristics (learned from the human-labeled train- ing data) of relevant versus irrelevant pairs. Classic IR ap- proaches typically assume a static document collection and 1191

Transcript of Learning-to-Rank for Real-Time High-Precision Hashtag ...gdac.uqam.ca/chines [36] and Latent...

Learning-to-Rank for Real-Time High-Precision HashtagRecommendation for Streaming News

Bichen Shibichen.shi@insight-

centre.org

Georgiana Ifrimgeorgiana.ifrim@insight-

centre.org

Neil Hurleyneil.hurley@insight-

centre.orgInsight Centre for Data Analytics

University College DublinIreland

ABSTRACTWe address the problem of real-time recommendation ofstreaming Twitter hashtags to an incoming stream of newsarticles. The technical challenge can be framed as large scaletopic classification where the set of topics (i.e., hashtags) ishuge and highly dynamic. Our main applications come fromdigital journalism, e.g., promoting original content to Twit-ter communities and social indexing of news to enable betterretrieval and story tracking. In contrast to the state-of-the-art that focuses on topic modelling approaches, we proposea learning-to-rank approach for modelling hashtag relevance.This enables us to deal with the dynamic nature of the prob-lem, since a relevance model is stable over time, while atopic model needs to be continuously retrained. We presentthe data collection and processing pipeline, as well as ourmethodology for achieving low latency, high precision rec-ommendations. Our empirical results show that our methodoutperforms the state-of-the-art, delivering more than 80%precision. Our techniques are implemented in a real-timesystem that is currently under user trial with a big newsorganisation.

Keywordslearning-to-rank; dynamic topics; social indexing; news; hash-tag recommendation

1. INTRODUCTIONSocial media platforms such as Twitter have taken a cen-

tral role in the consumption, production and disseminationof news [35]. Twitter has about 240 million active usersand receives more than 500 million tweets a day, a quar-ter of which are tagged with hashtags [31]. Hashtags arekeyword-based tags, describing the content of a tweet, forexample #taiwan, #transasia, #ge235 were used for tweetsdescribing a recent plane crash in Taiwan. They tend toappear spontaneously around breaking news or developing

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to theauthor’s site if the Material is used in electronic media.WWW 2016, April 11–15, 2016, Montréal, Québec, Canada.ACM 978-1-4503-4143-1/16/04.http://dx.doi.org/10.1145/2872427.2882982 .

news stories, and are a way for news followers to connect toa particular story and community, to get updates in real-time (e.g., #parisattacks). News organisations use hashtagsto target Twitter communities in order to promote origi-nal content and engage readers. Journalists sometimes in-troduce new hashtags, but the Twitter crowd is the onethat most often creates and selects a few of the possiblymany competing hashtags, thus echoing the current socialdiscourse (e.g., #migrant, #refugee, #refugeeswelcome).

However, an automatic approach to real-time, high preci-sion hashtag recommendation for news is currently missing,and both journalists and news readers have to invest signif-icant effort to manually search for relevant hashtags. Mostexisting approaches use topic modelling [20, 15], by con-sidering hashtags as topics, and mapping news articles totopics using content similarity, regardless of whether usersactively engage with those hashtags. As the relevant hash-tags change quickly (some die-off and new ones emerge),and the news and Twitter environments are highly dynamic,such approaches need to continuously retrain to adapt tonew content. In addition, since most existing approachesare trained on static collections of noisy tweets, they do notachieve high enough precision for practical use (e.g., pre-cision of 38% [6]), and cannot deliver recommendations inreal-time.

In this paper, we model the problem of recommendinghighly specific, actively used hashtags, to a stream of newsarticles, as an Information Retrieval (IR) learning-to-rank(L2R) problem. In our framework, a news article plays therole of the query in classic IR, and the hashtags (representedby tweets using those tags), play the role of documents. Forincoming articles, we split the recommendation process intotwo steps: (1) a pre-ranking step based on automatic queryformulation to connect each article to the hashtag streamand retrieve candidate hashtags and (2) a pre-trained L2Rapproach to score the relevance of the candidate article-hashtag pairs. From our analysis, the model of what makes ahashtag relevant to an article doesn’t change over time. Thismeans that we can use a small amount of human-annotateddata to train the L2R model once, and at test time com-pute features to describe and score the relevance of newarticle-hashtag pairs. Thus, we can recommend new hash-tags that were not previously seen in the training set, sincethe relevance model is not hashtag specific, but encodes gen-eral characteristics (learned from the human-labeled train-ing data) of relevant versus irrelevant pairs. Classic IR ap-proaches typically assume a static document collection and

1191

dynamic user-queries, while in our case both documents andtags come as streams, and the set of relevant tags for a givendocument changes over time. For example, tweets discussingthe news story “Plane crashes in southern France” are firsttagged with“#planecrash #france”. Over time, the hashtagsfor this story become more specific, such as “#germanwings#4u9525 #a320” driven by the usage of Twitter users. Toaddress the dynamic relevance of article-hashtag pairs, weinvestigate a set of low-cost, time-aware features.

We work with real-world data collected from existing RSSnews feeds which we connect to relevant Twitter streams,and build a live demo system to test our framework1. Togive a feel for the data scale, over a time period of 12 months,for about 1,000 news articles processed each day, we aver-aged more than 1 millions tweets and 26,000 hashtags perday (used in at least 10 tweets). Many of these hashtagsmay be only relevant for 24h or 48h, e.g., #illridewithyou,#icantbreath, #worldcancerday, with longer stories runningover weeks or months having a more stable set of hashtags,e.g., #ebola, #ebolavaccine, #grexit.

Contributions. We summarise our main contributionsas follows:

1. We formulate real-time hashtag recommendation forstreaming news as a L2R problem, and show that theL2R framework is more effective than topic modellingfor our dynamic problem setting.

2. We investigate time-aware features for high-precision,low-latency hashtag relevance ranking.

3. We conduct a real-life study of the impact of our rec-ommendations on news engagement and discuss appli-cations to social indexing of news (a form of real-timecrowdsourced tagging of news).

2. RELATED WORKHashtag Recommendation for Tweets. Prior work

focusing on hashtag recommendation for tweets relies ontopic modelling on static datasets. The work of [7, 22] buildsNaıve Bayes, KNN or SVM classifiers for hashtags, where ahashtag is seen as a category and the tweets tagged with thathashtag as labeled data for that category. Hashtag recom-mendation for tweets can be adapted to recommendation fornews, by treating the news headline as a rich tweet. As weshow in our experiments, this approach is overwhelmed bythe data scale, sparsity and noise characteristics of tweets.

Most other approaches focus on topic modelling with PLSAand LDA [20, 15, 12, 6, 16]. For example [6] fits an LDAmodel to a set of tweets in order to recommend hashtags.They combine the LDA model with a translation model,to address the vocabulary gap between tweets and hashtags.LDA-type approaches face drastic challenges regarding bothscalability and accuracy of recommendation, where eitherhashtags that are too general are recommended, e.g., #news,#life, or ones that are not used at all by the Twitter users,since the focus is on recommending hashtags solely driven bythe topic of tweets [6]. In addition, these models need to beconstantly retrained to adapt to the new emerging hashtags,which makes the process more time consuming.

Hashtag Recommendation for News. There is littleprior work focusing specifically on hashtag recommendationfor news. The approach in [33] relies on a manual user query

1Insight4news: http://insight4news.ucd.ie

to retrieve related articles, which are then clustered to cre-ate a topic profile. Similarly a hashtag profile is createdfrom tweets collected from a set of manually selected ac-counts. This approach then recommends hashtags with asimilar profile to a topic/cluster profile, without regard touser engagement with the hashtag, since the experimentsare done on a static collection. The work in [27] proposesan approach that updates hashtag recommendations oncedaily, while the emphasis in our work is on real-time recom-mendation.

Real-time Tag Recommendation. Related to our workare also recent approaches to real-time tag recommendationfor streaming scientific documents and webpages [29, 28]. Inthat work, the set of tags is assumed to be static, and fairlysmall, which facilitates a lot of pre-processing steps. In ourscenario, both articles and tags are continuously streaminginto the system, and the set of hashtags is very large and dy-namic (i.e., have a variable relevance lifecycle), which makesthe problem more challenging.

Learning to Rank. In classic IR learning-to-rank (L2R)approaches, a ranked list of documents is returned for a userquery. The set of documents is typically assumed to bestatic, which allows for clever indexing. The set of queriesis dynamic and a pre-processing step is used to produce aninitial document ranking, followed by a re-ranking step us-ing machine learning [17, 26]. Depending on the input rep-resentation and loss function, learning to rank algorithmscan be categorised [19] as pointwise [18, 23], pairwise [1,30, 10] and listwise [34, 25, 2]. Although listwise/pairwiseapproaches are commonly used, they are not suitable forour problem setting, due to the nature of our data and effi-ciency constraints. On average, there are about 1-5 relevanthashtags for each article (as identified by our labelling studyinvolving journalists), so the key concern is to quickly iden-tify the relevant few, in every time slot. From an efficiencypoint-of-view, the computational complexity of listwise andpairwise approaches is usually high [2, 11], making them lesssuitable for a real-time setting. Pointwise approaches wereshown to be efficient and well suited for binary relevancelabels [18, 26], therefore we take this approach in our work.We show how to model our problem in a L2R framework fordynamic settings. We compare our method to existing topicmodelling approaches: Naıve Bayes [7], Support Vector Ma-chines [36] and Latent Dirichlet Allocation [6] (see Section4.5).

3. LEARNING-TO-RANK FOR REAL-TIMEHASHTAG RECOMMENDATION

In this section we discuss the proposed L2R frameworkand the methodology for computing time-aware features forthe relevance model.

3.1 Learning-to-Rank ApproachIn an IR setting, a system maintains a collection of doc-

uments D. Given a query q, the system retrieves a subsetof documents d ∈ Dq from the collection, ranks the docu-ments by a global ranking model f(q, d), and returns the topranked documents. The f(q, d) model is constructed auto-matically using supervised machine learning techniques withlabelled ranking data [13].

In our hashtag recommendation setting, the query q is ex-tracted from an individual article a ∈ A, where A is a stream

1192

Table 1: Example article text used for extractingkeyphrases.

Headline Vladimir Putin in good health, insists KremlinSubheadline Spokesman says Russian president’s hand-

shake is strong enough to ’break hands’First Sentence Kremlin spokesman Dmitry Peskov said on

Thursday that president Vladimir Putin is ingood health, but could not say when he wouldnext appear in public.

of news. The document collection is a stream of hashtagsH, extracted from a stream of tweets T . For the reasonsstated in Section 2, we take a pointwise L2R approach bytransforming the ranking problem into a classification prob-lem [13, 18]: First, a subset of hashtags Ha is retrievedfor article a through a hashtag-sharding method explainedin Section 3.2. Then, for each article-hashtag pair (a, h),h ∈ Ha, we create a feature vector x, with label y ∈ {0, 1}(y = 1 if the hashtag is relevant for the article). Given mtraining examples M = {xi, yi}, i = 1, 2, ...,m, we constructa global classifier f(x) = y to predict y for any feature vectorx of an arbitrary article-hashtag pair.

To address the dynamic aspect of our problem (i.e., hash-tags and articles come as streams, and the relevance of hash-tags to articles is time-dependent since the hashtag represen-tation changes due to the arrival of new tweets), we extracttime-aware features xt as shown in Section 3.3. We writef(xt) = yt to denote that the feature vector is dependenton time, while the classification function f is not. We em-ploy two sliding time windows to transform the dynamicenvironment to a static one, at current time point tn. Theglobal time window γ = [tn−24h, tn] corresponds to the past24h from the current time tn, while the local time windowλ = [ta − 4h, tn], with ta the publishing time of article a, isan article-dependent time window, where ta ≤ tn. The localtime window restricts the computation of features to a local(in time) tweet subset. The choice of window parameters isjustified empirically and by the application domain, e.g., inthe news life-cycle most news either get updated (and be-come new articles) or are ignored after 24h [3]. We explainhow we use these time windows in Section 3.2 and 3.3.

3.2 Sharding the Tag StreamSimilar to query sharding in classic IR, we identify a set

of tweets associated with each article, which we call the ar-ticle’s tweet-bag, Ta. All hashtags contained in Ta are thearticle’s tag shard, Ha. Starting at some initial time t0, attime-step tn, the system carries out the following actions,where the interval between time-steps is 5mins:

1. Read RSS feeds, download articles, and extract key-phrases from each article (query formulation).

2. Pool the keyphrases for all articles published withinthe global time window γ, and retrieve a correspondingstream of tweets T .

3. If a retrieved tweet from T contains at least one key-phrase of an article a, append it to Ta.

4. From each tweet-bag Ta, extract the hashtags and as-sign them to this article-shard Ha.

5. Compute the feature vector xt of the article-hashtagpair (a, h). Feed xt to the relevance classifier, get hash-tag recommendation f(xt).

Table 2: Example process for extracting article-keyphrases.

Original keywords Paired keywords Ranked keyphrasesdmitry peskov dmitry peskov putin vladimirvladimir putin putin vladimir dmitry peskov

russian russian spokesman kremlin russianspokesman kremlin russian health kremlinkremlin health russian russian spokesmanhealth kremlin spokesman health russian

health kremlin kremlin spokesmanhealth spokesman health spokesman

Sharding the tag stream enables the retrieval of tags likelyto be relevant to the article, as well as quick computation offeature vectors for the article-hashtag pairs.

We investigate several methods for keyphrase extractionand show the impact of 3 such methods in our experiments(Section 4.2). The goal is to extract article-keyphrases tomaximize the retrieved number of tweets (a form of tweetRecall) and the content similarity of the retrieved tweets tothat article (a form of tweet Precision). The procedure forextracting keyphrases is as follows. Since news are written inan inverted pyramid form (i.e., the article focus is presentedin the beginning) we focus on the pseudo-article formed ofheadline, subheadline and first sentence of each article, andtokenize and POS-tag that text. In the best performingmethod, only nouns are selected, giving priority to propernouns over common nouns, as a light form of entity detec-tion. Single keywords are then paired and long proper nounsare broken down into term-pairs. This step is important foravoiding retrieval of noisy tweets. Finally, these pairs areranked based on the average tf-idf of the individual terms,with term-frequency computed from the article body andinverse-document-frequency computed from the article col-lection within γ. The top-5 pairs are used as the keyphrasesof the article. Table 1 and 2 show an example article textand the procedure for extracting keyphrases.

The tf-idf ranking extracts keyphrases that reflect themain article focus, e.g., if several city names are extracted:New York, London, Paris, but the article main focus is on“London business”, then “London business” will be rankedbefore “Paris business”. Limiting to top-5 pairs achievesa good trade-off between scalability and quality of the re-trieved tweet set.

3.3 Time-Aware FeaturesGiven an article a and corresponding article-shard Ha,

we form article-hashtag pairs (a, h), h ∈ Ha, and for eachpair create a feature vector xa,h. Since the relevance ofa hashtag to an article is time dependent, we extract time-aware features to describe the article-hashtag pair, and writethe classification function as f(xa,h,t) = yt.

We build on prior feature engineering work on Twitter andnews data [4, 24] to investigate useful features and adaptthem to our local and global time-windows λ and γ. Oneimportant aspect in feature engineering for L2R, is that fea-tures need to be comparable across queries, because we aimto learn a single ranking function for all queries using thesame set of features. Additionally, all features have to benormalized at query-level for dealing with the issue of dif-ferent candidate set sizes, and the variance between queries.

We identify five classes of features: Local, Global, Trend-ing, Headline and User. Four of them reflect properties ofthe hashtag, while the fifth reflects social network character-

1193

istics of users. Considering the real-time and high precisionrequirement of our approach, we only use low-cost features.

Bag-of-words Representation A tf-idf bag-of-words rep-resentation is formed from the text in each pseudo-article(headline, subheadline, first sentence) as a vector a:

a = tf(w, a)× idf(w,A)

where tf(w, a) is the term frequency of the term w withinthe whole article defined as in [21]:

tf(w, a) = 0.4 +(1− 0.4) ∗ freq(w, a)

max{freq(w′, a) : w′ ∈ a} . (1)

The inverse-document-frequency is computed from the arti-cle collection A, gathered in the time window γ:

idf(w,A) = log|A|

{a ∈ A : w ∈ a} (2)

Similarly, given any tweet-bag, T ′ ⊆ T , we form a bag-of-words representation as a vector h(T ′), whose componentsare the term frequencies of all terms occurring in the tweetsin T ′: tf(w, T ′).

Local similarity LSa,h,λ: Compares the article text toa local hashtag tweet bag via the cosine similarity as shownin Equation 3. Let Ta,h,λ be the subset of tweets in Ta thatmention h within time window λ. ||.|| denotes the L2 norm.

LSa,h,λ =a · h(Ta,h,λ)

‖a‖‖h(Ta,h,λ)‖ (3)

The local similarity is an important content feature thatindicates how relevant a hashtag is to an article.

Local hashtag frequency LFa,h,λ: Captures local pop-ularity of usage for a given hashtag in the article tweet-bagTa within time window λ.

LFa,h,λ =|Ta,h,λ| −min{|Ta,h′,λ|}

max{|Ta,h′,λ|} −min{|Ta,h′,λ|}(4)

LF ′a,h,λ =log |Ta,h,λ| −min{log |Ta,h′,λ|}

max{log |Ta,h′,λ|} −min{log |Ta,h′,λ|}(5)

where h′ ∈ Ha. We choose to include both the absolutesize of the tweet-bag and the log of its size as separate fea-tures, which are normalised using min/max feature scaling,as shown in Equations 4 and 5. The local frequency featurecompares all hashtags from the same set Ha, and indicateswhether a hashtag is dominating the topic.

Global similarity GSa,h,γ : Distinguishes between gen-eral and topic specific hashtags. It builds on similar equa-tions as for local similarity, but now the article bag-of-wordsrepresentation is compared with the whole hashtag tweet-bag Th within global window γ:

GSa,h,γ =a · h(Th,γ)

‖a‖‖h(Th,γ)‖ (6)

General hashtags like #news, may seem relevant to anarticle when looking at only the article tweet-bag Ta, but ifwe consider all tweets in Th, #news is irrelevant since it isused with all news stories. A topic specific hashtag shouldmaintain a high global similarity score to the article.

Global hashtag frequency GFh,γ : Captures global pop-ularity of usage for a given hashtag. Let |Th,γ | denote thenumber of tweets in Th within global time window γ. GFh,γ

is computed as in Equations 4-5, after replacing |Ta,h,λ| by|Th,γ |. A globally popular hashtag usually indicates a break-ing news, with more news articles published on that topic,which increases the probability of such hashtag being rele-vant to an article.

Trending hashtag TRa,h,tn : Captures a significant in-crease in local hashtag frequency and aims to identify article-wise trending hashtags. In order to separate emerging topicspecific hashtags (e.g., #charliehebdo, #jesuischarlie for arecent terrorist attack in France) from hashtags with a highgeneral usage rate (e.g. #news, #breaking), being ableto identify trending hashtags early on is very important.Therefore, it is not enough to only capture the current hash-tag frequency, but we also need to check how quickly this isincreasing.

Given time window Wn = tn− tn−1, the number of tweetsmentioning h in tweet stream Ta in time window Wn is|Ta,h,Wn |, then:

TRa,h,tn =|Ta,h,Wn | − |Ta,h,Wn−1 |

|Ta,h,Wn−1 |(7)

Expected gain EGa,h,Wn : Captures the potential of hin the near future (a few minutes later), and is expected toboost trending hashtags while punishing fading ones.

Based on trending feature TRa,h,tn , we also have the ex-pected number of tweets in Ta mentioning h for the nexttime window Wn+1, denoted by E(|Ta,h,Wn+1 |):

EGa,h,Wn = E(|Ta,h,Wn+1 |) = (1+TRa,h,tn) · |Ta,h,Wn | (8)

We create two features, the absolute expected gain and thelog of this value with min/max scaling as in Equations 4-5.

Hashtag in headline HEa,h: After observing user be-haviour and trending hashtags over time, one can noticethat many hashtags literally reflect their topic. They are avariation of the name of the people/place/event being dis-cussed. It could be an acronym (e.g. #cwc2015 for cricketworld cup 2015), or concatenated names (e.g. #sydneysiegefor the Sydney hostage attack). Although this is not al-ways the case (e.g. #carrythemhome for England Rugby,#icantbreathe for Eric Garner’s death), being able to usesuch information may help the classifier. We define HEa,has a binary feature equal to 1 if the hashtag is in the pseudo-article (headline, sub-headline, first sentence) after removingspace between terms.

Unique user ratio URa,h,λ: The ratio of unique Twitterusers using h in Ta within time window λ, to the number oftweets. Function User(T ) returns the set of users in tweetstream T .

URa,h,λ =|User(Ta,h,λ)||Ta,h,λ|

(9)

Noise filtering is extremely important for Twitter hashtagrecommendation, because there are many spam users andtwitter-bots posting spam tweets with self-created hashtags.The unique user ratio can help the classifier separate thegenuinely popular hashtags from spammy hashtags, some-thing that local/global frequency cannot achieve.

User credibility UCa,h,λ: The quality of a hashtag de-pends on the users using it. A commonly used Twitteruser credibility indicator is the number of followers. Userswith more followers are usually celebrities, domain experts

1194

and experienced users that work hard to attract followers.Therefore, we define user credibility as the maximum, theaverage and the median of the followers of users tagging hin article tweet bag Ta in λ.

MaxFa,h,λ = max(Follower(u)), u ∈ User(Ta,h,λ) (10)

UCmax is the min/max scaled MaxFa,h,λ; we also com-pute UCavg and UCmedian using a similar approach.

Cost of features. As shown by the previous equations,most of our features are based on the local article tweet-bagTa, which is fairly small. Hence they are cheap to obtain.The experiment in Section 4.5 shows that the execution timefor feature computation, which is the most time consumingstep in our approach, grows linearly with the number oftweets for the entire article collection.

4. EVALUATIONIn this section we discuss our methodology for gathering

labeled data and show extensive experiments analysing ourtechniques in comparison to the state-of-the-art (SOTA).

4.1 Gathering labeled dataAs discussed in the previous sections, we model hashtag

recommendation as a L2R problem via a relevance classifi-cation approach. Here we describe the process of gatheringlabeled data for the classifier. We define three classes ofrelevance for each article-hashtag pair:

• A hashtag is specifically on the topic of the news arti-cle. For example, for articles describing the recent Ger-man Wings plane crash, “#germanwings, #4u9525,#a320” fall in this category.• A hashtag is generally on the topic of the news ar-

ticle. For example, for the same story, “#barcelona,#france” fall in this category.• Irrelevant hashtags, including off topic and spamy hash-

tags. For example “#news, #bbc, #breaking” fall inthis category.

One way to gather cheap labels is to use the tweets thatcontain both article URLs and hashtags, and consider thosehashtags as relevant labels. However, in our initial exper-iments, we found such data too little and too noisy. Mosttweets with hashtags, although relevant to the article, donot contain the article URL. Additionally, a quarter of thetweets with article URLs are tagged with #news #break-ing, which are too general, while many others are taggedwith spam hashtags or a mixture of relevant and irrelevanthashtags. It is therefore difficult to directly use this data fortraining, so instead, we decided to collect high quality la-bels by involving manual annotators in a real-time labellingexercise.

In order to gather relevance labels quickly, we have im-plemented our methods in a system that can be accessedvia a Web interface2. We continuously track the RSS newsfeeds of 7 news organizations: Reuters, BBC, Irish Times,Irish Independent, Irish Examiner, RTE, and The Journal,publishing around 900-1,000 articles each day. Using themethods described above, we extract article-keyphrases andretrieve tweets using the Twitter Streaming API, updatingthe tweet-bag of each article every 5 mins over a 24h period.

2The Insight4News system for gathering labeled data:insight4news.ucd.ie/

Table 3: Details on the labeled data pairs.Total Positive Negative Collection Period1238 348(28.1%) 890(71.9%) 04/12/2014-08/01/2015

Articles Involved Hashtags Involved217 725

Figure 1: The system interface used to gather feed-back on hashtag relevance with respect to an article.

The users can see the hashtags retrieved via simple baselinesfor each article and can provide feedback for each hashtag.The baselines use simple frequency of usage for a hashtag orthe local cosine similarity between the article and hashtagprofiles for h ∈ Ha over local time window λ. The inter-face as shown in Figure 1 enables users to quickly providefeedback while browsing the news presented.

One interesting aspect of labeling in this dynamic con-text is that for the same article-hashtag pair the label maychange depending on the time of labeling, in particular morespecific hashtags emerge as users engage with news storieson Twitter. To simplify the labeling procedure, users wereinstructed to decide only if a hashtag is relevant (specificallyor generally on the topic) or irrelevant to the article, at thetime they are labeling it. We exposed this Web interfacewith the above instructions to a group of researchers andjournalists over 1 month, allowing us to gather around 1,200labeled examples3. We use this data as ground truth forevaluating various features and approaches. Details on thelabeled data distribution are given in Table 3. Note that ar-ticles are only paired with subsets of hashtags, rather thanall hashtags (e.g., the labeled pairs are a subset of the fullcross-product of articles and hashtags).

4.2 Experiment 1: Keyphrase ExtractionIn this section we evaluate three different approaches for

extracting keyphrases with the goal of maximizing a form ofprecision and recall on the retrieved tweet set for the article.The methods compared are:

1. Tf-idf unigrams: Select all unigrams (single words) inthe pseudo article (headline, sub-headline, first sen-tence). Compute tf-idf of single words using full arti-cle. Pair words to form 2-gram phrases. Rank pairsby the average of tf-idf scores of individual terms, takethe top 5 pairs as the article’s keyphrases.

2. POS-tag: Apply part-of-speech-tagging to the pseudoarticle. Take the first 5 nouns/phrases by giving prior-ity to entities (noun-phrases, proper nouns), frequentnouns, all other nouns. Pair the single nouns to 2-gramphrases, break long noun phrases into 2-grams. Takethe first 5 pairs as the article’s keyphrases (alphabeti-cal order).

3Labeled data available from https://sites.google.com/site/bichenshi/

1195

Table 4: Example article and extracted keyphrases using three approaches.Headline Putin re-emerges in public after rumours over 10-day absence

Subheadline Russian president jokes to media that life ’would be boring if there was no gossip’First Sentence Vladimir Putin has reappeared in public after a mysterious 10-day absence

that sparked frenzied speculation about the whereabouts of the Russian president,his health and mental wellbeing, and even his grip on power.

Tf-idf president whereabouts, mysterious whereabouts, wellbeing whereabouts,frenzied whereabouts, absence whereabouts

POS-tag gossip president, life power, media president, absence president, health mediaPOS-tag + Tf-idf president russian, absence russian, putin vladimir, power russian, grip russian

Table 5: Similarity versus number of tweets re-trieved via keyphrase extraction using 3 methods.

Tf-idf POS-tag POS-tag + Tf-idfAvg Cosine 0.1374 0.1321 0.1712Avg No. Tweets 753.60 1092.13 1870.58

3. POS-tag + Tf-idf (our approach): Same process asPOS-tag but for selecting final subset, rank pairs byaverage tf-idf score of individual terms and select thetop 5 pairs as the article’s keyphrases.

Experiment Setup. We collect 300 news articles and ex-tract keyphrases using the above three approaches. We trackthese article-keyphrases for 24h, via the Twitter StreamingAPI, and for each article we gather 3 tweet-bags correspond-ing to the three sets of article keyphrases. Table 4 showsexample keyphrases for a given article, under different selec-tion strategies.

For each of the 3 approaches, we have 300 articles, andeach article has one tweet-bag. To estimate the precisionof each approach, we compute the cosine similarity betweenthe article and its tweet-bag tf-idf profile, then average overthe 300 articles. This gives us an indicator of the focus of thetweet-bag. To estimate the recall, we average the sizes of thetweet-bags (number of tweets per article) over all articles.

Evaluation. As shown in Table 5, combining POS-tagging(for light entity detection) and tf-idf ranking (to focus on theright terms), retrieves twice as many tweets as compared tothe other two approaches, and the tweets are also more sim-ilar to the article content. Note that this article-keyphraseextraction step is focused on retrieving relevant content froma very noisy and fast-paced social media stream such asTwitter, rather than being a generic article query formu-lation method. Although still noisy, this step is followed bya precision oriented ranking based on a learning approach.

4.3 Experiment 2: Feature EvaluationExperiment Setup. We present a thorough analysis

of the influence of different features on learning a hashtagrelevance classifier. The evaluation is done via 10-fold cross-validation on the full labeled set (1.2k examples). Sincethe time effect is encoded in the feature vector, randomisingsamples in the cross-validation step is not an issue. We showfurther evidence for this statement in Section 4.4. For therelevance classifier we use an ensemble approach: RandomForest. Our choice is based on previous studies that showedRandom Forests are robust to noise and very competitiveregarding accuracy [9].

We test different combinations of the five types of fea-tures (Local, Global, Trending, Headline, and User, 14 fea-tures in total), and compare the classification performance.

Local Frequency Local Similarity Global Frequency

Global Similarity Tag In Headline Unique User Ratio

Avg Follower Max Follower Expected Gain

0.00

0.25

0.50

0.75

1.00

0.0

0.2

0.4

0.6

0.8

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

FALSE TRUE FALSE TRUE FALSE TRUE

FALSE TRUE FALSE TRUE FALSE TRUE

FALSE TRUE FALSE TRUE FALSE TRUE

Label

FALSE

TRUE

Features

Figure 2: Boxplot distributions of 9 features in rel-evant (blue/TRUE) and irrelevant (red/FALSE) la-belled data.

Figure 2 shows the distribution in the labelled data of 9most important features. We use three standard machinelearning metrics to evaluate classification quality: Precision,Recall and AUC [32]. The first two measure the classifica-tion quality after a threshold is imposed on the classificationscore. We show Precision and Recall on the positive class(coined PP and PR) as well as the weighted average Preci-sion and weighted average Recall over both classes (coinedWP and WR). The latter averages the Precision/Recall forthe 2 classes weighting performance on each class by classsize [32]. AUC measures ranking quality and is not depen-dent on a classification threshold. In practice we are moreconcerned with Precision and AUC quality, since for our ap-plication domain it is more important to have high Precision(recommend a few specific hashtags in top ranks), than highRecall (retrieve all relevant hashtags).

Evaluation. Table 6 shows the results of using differ-ent combinations of features. Basic refers to using Localand Global article-hashtag content and popularity features.Norm1 and Norm2 refer to min/max scaling of the originalversus the log of feature values (as described in Equation 4and 5). All(Norm1&2) is the approach that includes all 14features used in this work. Most related work in this areahas focused on the type of features included in approachBasic. We observe that the three other categories of fea-

1196

Table 6: Evaluating features of the relevance classi-fier for hashtag recommendation.

PP PR WP WR AUCBasic(Norm1) 81.5% 63.2% 85.3% 85.6% 86.7%Basic(Norm2) 82.5% 63.8% 85.7% 86.0% 85.8%Basic(Norm1&2) 83.8% 63.8% 86.1% 86.3% 87.0%Basic+Trending 83.8% 66.7% 86.8% 87.0% 90.3%Basic+Headline 82.2% 73.0% 87.7% 88.0% 92.5%Basic+User 84.3% 64.7% 86.5% 86.7% 89.8%All, no User 84.0% 74.1% 88.6% 88.8% 94.1%All, no Headline 85.1% 64.1% 86.6% 86.8% 91.3%All, no Trending 84.7% 75.0% 89.0% 89.2% 94.3%All(Norm1) 87.2% 76.1% 90.0% 90.1% 94.9%All(Norm2) 88.5% 75.0% 90.1% 90.2% 94.9%All(Norm1&2) 87.5% 76.4% 90.2% 90.3% 95.0%

tures (Trending, Headline and User), and the two normal-ization approaches, increase the Precision/Recall by 5% andthe AUC by 9%.

4.4 Experiment 3: Size of Training Data andTime Effect

In this experiment we analyse the influence of the numberof training examples, as well as the time effect on the clas-sification quality. We carry out two experiments. The first,studies the effect of recency and size of training data, on thequality of recommendation (variable training set, fixed test).The second, checks whether the quality of recommendationremains stable over time (5 months) given that we do notretrain the classifier (fixed training set, variable test).

4.4.1 Training Size versus RecencyExperiment Setup. We order the 1.2k labeled examples

by time from the oldest to the most recent. We use the mostrecent 400 examples as hold-out test set, and gradually addin examples to the training set by batches of size 50, andtrain a Random Forest classifier. We compare two strate-gies for selecting training data: backward and random. TheBackward approach selects examples starting with the mostrecent 50 examples and adds 50 by going back in time toolder examples, until it reaches 800 training examples. TheRandom strategy selects a random sample of given size, fromthe set of 800 training examples.

Evaluation. Figure 3 shows the Precision of the clas-sifier tested on the hold-out test set when increasing thenumber of training examples, with the two sampling strate-gies. The plots for Recall and AUC behave similarly andare not shown here. We note that both strategies behavesimilarly, with Precision increasing quickly with the num-ber of labeled examples. The Random strategy delivers lessstable Precision at smaller sample sizes. This is due to thevariation in the positive/negative ratio of examples in thoselabeled training sets. Nevertheless, both methods achievesimilar Precision at about 700 labeled examples, suggestingthat the sampling strategy is not important once enoughtraining data is available.

4.4.2 Recommendation Quality over TimeExperiment Setup. We use the entire 1.2k labeled ex-

amples, which are collected in December 2014, to train aRandom Forest classifier. For each month from March toJuly 2015, we randomly pick one day and use articles fromthat day as testing data. We ask a group of researchers to

0.6

50.7

00.7

50.8

00.8

5

Size of the Training Set

Pre

cis

ion

50 150 250 350 450 550 650 750

Backward

Random

Figure 3: Precision for different size training datawith two sampling strategies, tested on hold out set.

0.5

0.6

0.7

0.8

0.9

March April May Jun July

0.872

0.901

0.865 0.875

0.899

72%

77.0%

67.3%

61.0%

65.0%

Precision@1

Article Coverage

Figure 4: Precision@1 and article coverage for the 5test days from March to July 2015 (relevance scorethreshold at 0.5).

evaluate the top one recommendation (ranked by the classifi-cation score), for each article in the 5 test days. The thresh-old of classification score is set to 0.5, which means an articlegets a hashtag recommendation only if at least one hashtaghas predicted relevance score above the 0.5 threshold.

Evaluation. We measure the average Precision@1 foreach test day, based on the evaluation results of the annota-tors. In practice, we are also interested in the percentage ofarticles that get a recommendation (article coverage), whichvaries with the selected threshold, and is also influenced bythe Twitter activity and topics of the news article on thatday. Figure 4 shows the Precision@1 for all 5 days is around0.87 and the percentage of articles covered is in the rangeof 60% − 80%. The result suggests that even though theclassifier is trained on December’s data, the quality of rec-ommendation remains stable when tested on data of half ayear later, thus collecting new labeled data to retrain theclassifier is not necessary.

4.5 Experiment 4: Comparison to SOTAIn this experiment, we compare our approach (named

Hashtagger) to three SOTA hashtag recommendation tech-niques using topic modelling: Naıve Bayes, Liblinear, andLDA. We study the precision as well as scalability of theseapproaches. The main difference between our method ver-sus existing methods is in how modelling is done, and whattraining data and features are used. Regarding modelling,

1197

the SOTA approaches focus on modelling each hashtag asa topic, as in classic topic classification, while we modelthe hashtag relevance with a L2R approach. For the train-ing data, the SOTA approaches are trained on most recenttweets with hashtags, and need to be retrained constantlyto adapt to the new content. As our ranking classifier onlyneeds to be trained once, it uses the 1.2k label data for theentire run. Regarding features, most methods rely on textsimilarity (between article and hashtag representation) andthe frequency of usage of the hashtag. We compare our ap-proach to prior techniques, using the same tweets and setof features (local text similarity and frequency) to assessthe impact of the modelling approach. Additionally, we alsoshow results for our method using the full set of featuresproposed, to asses the impact of modelling plus features.

1. Naıve Bayes4 [7]: A hashtag is seen as a category andtweets mentioning that hashtag are used as labeleddata to train a Naıve Bayes classifier via multi-classclassification.

2. LibShortText [36]: A library for short-text classifica-tion and analysis that builds upon the SOTA libraryLibLinear [8], which support millions of instances andfeatures. LibShortText implements multi-class Gaussian-kernel SVM. Similar to the Naıve Bayes approach, eachhashtag is considered as a category and tweets men-tioning a hashtag are used as labeled data.

3. LDA5 [6]: Topic modelling with Latent Dirichlet Allo-cation representing each tweet as a mixture of topics.Trained on a collection of tweets, LDA returns a set ofscored topics that each tweet belongs to, each topic istypically represented as a group of ranked words. Weuse the highest scored topic to recommend hashtags.

4.5.1 PrecisionAs an evaluation metric we use Precision@1 by recom-

mending the maximum score prediction of each method.Experiment Setup. The three SOTA approaches are

designed to work best in a static environment, where theset of tweets and hashtags are static and are analysed in anoffline batch mode. To adapt them to a real-time environ-ment, we retrain these methods in a sliding window style:Given a time t0, all methods are trained on all tweets (withhashtags) falling in a 4h window ahead of t0, then they rec-ommend hashtags for articles that are posted up to 2h aftert0. At the next time point t1 = t0 + 2h, we discard the pre-vious models, retrain all models with new tweets that are4h ahead of t1, and use these models for another 2h. Be-cause the Naıve Bayes’ running time is very short (as shownin Section 4.5.2), we test Naıve Bayes under two settings:retrain every 2h and every 10min.

We randomly pick a starting time point t0 (0:00, April14th, 2015, UTC), then run the experiment for 24h, involv-ing 270 articles and 313k tweets that have at least one hash-tag (about 26.1k tweets per 4h time window). Then eachpseudo article (headline, sub-headline and first sentence) isconsidered as a rich tweet, and each method recommendsone hashtag to each article. For LDA, the number of topicsis set to 50 per time window and the number of iterationsis 100, and we use the top ranked term in the top rankedtopic as a recommended hashtag [20, 6, 16]. In order for all

4http://scikit-learn.org/...MultinomialNB5https://pypi.python.org/pypi/lda

methods to work from the same data, we test Hashtaggeron feature vectors computed over the tweets in tweet-bag Tathat are published up to 4h ahead of the article publishingtime ta. Also, we test two versions of the Hashtagger: Hash-tagger(2) uses only two local features(LFa,h,λ and LSa,h,λ),while Hashtagger(All) uses all 14 features.

Evaluation. We asked a group of annotators to evaluatethe 6 ∗ 270 = 1620 article-hashtag pairs as relevant/ irrel-evant and average their results. As each method gives onerecommendation per article, accompanied by a predictionscore, the Precision@1 and the number of articles that geta recommendation (article coverage rate) are both functionsof the threshold on the prediction score. A higher thresh-old value results in a better recommendation quality, butwill naturally reduce the article coverage rate. Since thepredicted scores of different methods are not directly com-parable, we compare the Precision@1 for the five methodsunder different article coverage rates. For each method, wechange the threshold to each unique predicted value in in-creasing order, and record the article coverage rate and thePrecision@1 at that threshold, as shown in Figure 5.

When the article coverage rate is 100% (e.g. we record 1recommendation for each article regardless how low the pre-diction score), the Precision@1 for Hashtagger(All), Hash-tagger(2), Naıve Bayes(2h), Naıve Bayes(10min), LibShort-Text, and LDA is 0.618, 0.533, 0.374, 0.396, 0.447 and 0.385.The results for the SOTA methods are in agreement withpublished studies [20, 15, 12, 6, 16]. Naıve Bayes(10min)has higher Precision@1 than Naıve Bayes(2h) showing thatfrequent retraining could reduce the content gap betweentraining and test sets. Regardless of the article coveragerate, Hashtagger(2), which uses only basic similarity andfrequency features, constantly out-performs the other threemethods, showing the positive impact of our modelling. Hash-tagger(All) has the highest Precision@1 score, suggestingthat both modelling and feature engineering are important.For a fixed threshold of 0.5, Hashtagger(All) has Precision@1of 0.89.

Table 7 shows recommended hashtags and prediction scoresof the five approaches. Hashtags in bold are labelled asrelevant by all our annotators. We note that Hashtaggergives more reliable recommendations, including recommend-ing specific hashtags (e.g. #wiveng for West India vs Eng-land), while the other three approaches provide more gen-eral, even irrelevant hashtags. One advantage of Hashtaggeris that, unlike other SOTA approaches, it can predict therelevance of unseen articles and hashtags that have not ap-peared in the training set. Due to our choice of modelling, itis also possible for us to gather a small amount of high qual-ity manual labels for the classifier, which is much cleanerthan the tweets with hashtags used by SOTA approaches.Gathering manual labels for other SOTA approaches is notfeasible because they need to be retrained with new labelsvery often. Hence, Hashtagger achieves higher Precision@1than SOTA approaches.

4.5.2 ScalabilityExperiment Setup. To further examine the scalability

of the four approaches, we compare their execution time byincreasing the number of tweets for training/testing. ForNaıve Bayes, LibShortText and LDA, we take different sizesamples of tweets from 10k to 150k, as training data, andrecord their model fitting time. We repeatedly run Hash-

1198

Table 7: Examples of recommended hashtags by the four compared methods.Article Headline Hashtagger(All) Score Hashtagger(2) Score Naıve Bayes Score LibShortText Score LDA ScoreNokia in deal talks with Alcatel-Lucent #nokia 0.83 #news 0.93 #news 0.90 #follow 0.09 #home 0.1Ian Bell ton gives England the upper hand in Antigua #wiveng 0.52 #wiveng 0.65 #lfc 0.72 #iran 0.13 #news 0.11Syria-bound son of British councillor deported from Turkey #syria 0.97 #syria 0.88 #yemen 0.68 #news 0.77 #wallstreet 0.04Seventeen killed in attack on Somalia education ministry #somalia 0.99 #somalia 0.94 #somalia 0.97 #somalia 0.23 #somalia 0.1

Article Coverage

Pre

cis

ion@

1

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Hashtagger(All)Hashtagger(2)NB(2h)NB(10min)LibShorttextLDA

0.89(th=0.5)

Figure 5: Precision@1 and article coverage of thesix methods compared.

01

00

20

03

00

40

0

Number of tweets in training/testing set

Cla

ssifie

r exe

cu

tio

n t

ime

(s)

0 20k 40k 60k 80k 100k 120k 140k 160k

Hashtagger(All)

Naive Bayes

LibShortText

LDA

Figure 6: Running time of the four methods usingdifferent size of tweet set for training/testing.

tagger over randomly selected article collections with totaltweet-bags size ranging from 10k to 150k.

Evaluation. The execution time shown in Figure 6 foreach classifier matches known results: Naıve Bayes, knownto be very efficient with linear training/testing complexity[21], takes around 100s to train on 150k tweets. The RBFSVM (LibShortText), with complexity O(n3) [5], takes 120sto train on only 10k tweets. The training speed of LDA isbetween Naıve Bayes and SVM taking 100s to train on 30ktweets (with 50 topics). Hashtagger has similar linear timecomplexity for testing as Naıve Bayes for training, process-ing 150k tweets in 100s. Nevertheless, Hashtagger deliversmuch higher recommendation precision.

5. APPLICATIONSIn this section we study two applications of real-time hash-

tag recommendation for news. The first uses Twitter asa publishing platform of online news and measures the ef-fect of attaching hashtags to headlines as a way of reachingwider Twitter communities, which in turn is hypothesised tolead to more engagement with those news (e.g., more URL

clicks). The second application looks at the benefits of in-dexing news using recommended crowdsourced tags (whichwe call social indexing), for better news retrieval and storytracking.

5.1 Online News PublishingWe study the impact of our hashtag recommendations by

automatically tweeting news headlines as follows. As soonas a headline is retrieved from an RSS feed and it receivesa hashtag recommendation from our system, it falls intoone of three groups, decided by a random variable. Thefirst group is tweeted as is (headline + URL), the secondis tweeted by appending #news to each headline (headline+ #news + URL) and the remaining group is tweeted withthe top hashtag recommended by Hashtagger. We then useimpact metrics provided by Twitter Analytics6 to comparethe 3 groups of headlines. The goal is to assess whethertweeting the news headlines with our recommended hashtagsleads to higher engagement with those news, as compared tonot using any hashtags, or using a generic hashtag such as#news. The hypothesis is that by attaching good hashtagsto the news headlines, those news reach wider and possiblymore engaged audiences.

We automatically tweet from a Twitter account named@insight4news3 which we use for researching the effect ofpublishing hashtagged news on Twitter. This account wascreated in April 2015 and at the time of writing has is-sued 99k tweets and has 438 followers. We run the pro-cess described above over 3 months, and draw a sample of15k tweeted news headlines, split into the 3 groups (5k pergroup). We collect the total impressions, engagement andURL clicks as provided by Twitter Analytics. The originaldata is available here7. Figure 7 shows these metrics for the3 groups. In order to avoid spurious results, we remove theoutliers for each group and metric (the top 5% quantile).

We observe that tweets with no hashtag and with #newsattract similar total amount of impressions (85k), engage-ments (400/600) and URL clicks (300), with the #newsgroup only slightly better than the no-hashtag group, show-ing that a generic hashtag does not draw more audience totweets. Tweets with our recommended hashtags generatemore traffic, with 150k impressions, 1.3k engagements and750 url clicks. In addition, the engagement rate (the numberof engagements over impressions) is also increased comparedto the no-hashtag group: 0.86% versus 0.47%, suggestingthat our approach helps tweets reach a wider audience, andleads to increased user engagement with the news articles.

5.2 Social Indexing of NewsThe classic approach to indexing documents is to use key-

words extracted from those documents. For example, for anews headline ”Greek crisis: Euro zone rules out talks untilafter referendum”, the corresponding article would hypothet-

6https://gnip.com/docs/Simply-Measured-Complete-Guide-to-Twitter-Analytics.pdf7https://drive.google.com/file/d/0B3N3pPOTCaegdFRtbzBGbkVXMnc

1199

No Hashtag #News Hashtagger

050

000

100

00

0150

00

0 Sum of Impressions

No Hashtag #News Hashtagger

02

00

60

01

00

0

Sum of Engagements

No Hashtag #News Hashtagger

02

00

40

0600

Sum of Url Clicks

Figure 7: Twitter Analytics metrics to measure impact of hashtagging on news engagement.

ically be indexed by the keywords ”greek, crisis, euro, refer-endum”. When issuing a query such as ”greece crisis euro”articles indexed by these keywords are retrieved from thearticle collection. Although the accuracy of keyword searchhas continuously improved, the main weaknesses remain: (1)Missing articles that don’t have these exact keywords; (2)Returning too many irrelevant results.

An accurate method for associating articles and Twit-ter hashtags, allows us to index articles using keywords andhashtags. For example, the above article could be indexedby ”greek, crisis, euro, referendum, #greece, #grexit, #gref-erendum, #tsipras, #eurogroup”. This also means that nowwe can formulate queries that mix keywords and hashtags,such as greece #grexit. We coin this social indexing, thebenefits of which are three-fold:

1. Takes advantage of crowdsourced content as a form ofreal-time, continuous tagging of news.

2. Hashtags are not necessarily topical, and they havethe advantage of grouping together articles belongingto the same story (e.g., racial conflicts in US, #eric-garner, #blacklivesmatter, #icantbreathe).

3. Hashtags allow the query to focus on diverse aspectsof a story (e.g., Greek economic crisis, #grexit, #gref-erendum, #tsipras, #merkel, #ecb, #imf, #finland).

We discuss the above points in the context of story track-ing. Many news organisations offer story-pages on theirwebsite, i.e., curated collections of news articles that allowthe reader to get an overview and updates on particularproblems, e.g., referendums, elections, budgets. The IrishTimes has dedicated story-pages for issues of relevance tothe Irish society, e.g., the introduction of a tax on water8

(Twitter hashtag #irishwater), the inquiry into the bank-ing collapse9 of 2008 (Twitter hashtag #bankinginquiry),the recent marriage equality referendum10 (Twitter hashtag#marref). Similarly, the BBC and The Guardian also pub-lish story-pages, e.g., the BBC story-page on the ”Greek debtcrisis”11 and Guardian page on Liberia12. Preparing thesestory-pages currently relies on prior agreement among jour-nalists, to manually tag all articles relevant to a pre-agreedset of stories, with the same set of tags. Once a decision is

8http://www.irishtimes.com/news/water-charges9http://www.irishtimes.com/news/banking-inquiry

10http://www.irishtimes.com/news/politics/marriage-referendum

11http://www.bbc.com/news/world-europe-3322546112http://www.theguardian.com/world/ebola

taken to create a story-page, those articles are continuouslyretrieved from the news archive via the manual tag set. Theproblem with this approach is that it relies on foresight overwhich stories are worth covering and what is the right tagto use for those story-articles. By building on our hash-tag recommendation approach, we let the Twitter crowd dothe tagging in real-time (via Hashtagger), potentially cap-turing novel emerging concepts. The assumption is thatmost stories that are worth story-pages have a lot of qual-ity discussions and focused hashtags on Twitter, hypothesiscurrently supported by our experiments. For example #mi-grant covers the unfolding migrant/refugee crisis, retrieving90 articles13 with this recommended hashtag in the timeperiod August 27 to October 15, 2015. The #refugee tagretrieves 149 articles over the same time period, indicatinga potential change of discourse around this issue. We intendto further study the use of social indexing for story trackingand retrieval.

6. CONCLUSIONWe present Hashtagger, an approach for real-time high-

precision hashtag recommendation for streaming news. Ourmethod relies on a learning-to-rank model tailored to a dy-namic setting where news and tags are streaming and havevariable life-cycles. We systematically study our approach incomparison to the state-of-the-art and show that our methoddelivers much higher Precision compared to existing meth-ods. This is due to our choice of modelling approach (rele-vance ranking versus topic modelling) and the set of time-aware features we investigate. Hashtagger is designed towork in real-time, real-world application settings. We em-ploy our recommendations in a real-life study using Twitteras an online news publishing platform, and show that accu-rate hashtagging drives higher news engagement. We alsodiscuss the implications of building on hashtag recommen-dation for social indexing of news. For the future we intendto analyse the impact of hashtag recommendation on auto-matic story detection and tracking.

7. ACKNOWLEDGMENTSThis work was funded by Science Foundation Ireland (SFI)

under grant number 12/RC/2289.

13http://insight4news.ucd.ie/insight4news/hashtag/\%23migrant

1200

8. REFERENCES[1] C. Burges, T. Shaked, E. Renshaw, A. Lazier,

M. Deeds, N. Hamilton, and G. Hullender. Learning torank using gradient descent. In Proceedings of the22nd international conference on Machine learning,pages 89–96. ACM, 2005.

[2] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li.Learning to rank: from pairwise approach to listwiseapproach. In Proceedings of the 24th internationalconference on Machine learning, pages 129–136. ACM,2007.

[3] C. Castillo, M. El-Haddad, J. Pfeffer, andM. Stempeck. Characterizing the life cycle of onlinenews stories using social media reactions. InProceedings of the 17th ACM conference on Computersupported cooperative work & social computing,pages 211–223. ACM, 2014.

[4] J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, andJ. Leskovec. Can cascades be predicted? InProceedings of the 23rd international conference onWorld wide web, pages 925–936. ACM, 2014.

[5] K. Crammer and Y. Singer. On the algorithmicimplementation of multiclass kernel-based vectormachines. The Journal of Machine Learning Research,2:265–292, 2002.

[6] Z. Ding, X. Qiu, Q. Zhang, and X. Huang. Learningtopical translation model for microblog hashtagsuggestion. In Proceedings of the Twenty-Thirdinternational joint conference on ArtificialIntelligence, pages 2078–2084. AAAI Press, 2013.

[7] R. Dovgopol and M. Nohelty. Twitter hash tagrecommendation. arXiv preprint arXiv:1502.00094,2015.

[8] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin.LIBLINEAR: A library for large linear classification.Journal of Machine Learning Research, 9:1871–1874,2008.

[9] M. Fernandez-Delgado, E. Cernadas, S. Barro, andD. Amorim. Do we need hundreds of classifiers tosolve real world classification problems? J. Mach.Learn. Res., 15(1):3133–3181, Jan. 2014.

[10] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. Anefficient boosting algorithm for combining preferences.The Journal of machine learning research, 4:933–969,2003.

[11] J. Furnkranz and E. Hullermeier. Pairwise preferencelearning and ranking. In Machine Learning: ECML2003, pages 145–156. Springer, 2003.

[12] F. Godin, V. Slavkovikj, W. De Neve, B. Schrauwen,and R. Van de Walle. Using topic models for twitterhashtag recommendation. In Proceedings of the 22ndinternational conference on World Wide Webcompanion, pages 593–596. International World WideWeb Conferences Steering Committee, 2013.

[13] L. Hang. A short introduction to learning to rank.IEICE TRANSACTIONS on Information andSystems, 94(10):1854–1862, 2011.

[14] J. Harding. Future of news. BBC, 2015.

[15] T.-A. Hoang-Vu, A. Bessa, L. Barbosa, and J. Freire.Bridging vocabularies to link tweets and news.

[16] Z. D. Q. Z. X. Huang. Automatic hashtagrecommendation for microblogs using topic-specific

translation model. In 24th International Conference onComputational Linguistics, page 265. Citeseer, 2012.

[17] T. Joachims. Optimizing search engines usingclickthrough data. In Proceedings of the eighth ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 133–142. ACM,2002.

[18] P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning torank using multiple classification and gradientboosting. In Advances in neural informationprocessing systems, pages 897–904, 2007.

[19] T.-Y. Liu. Learning to rank for information retrieval.Foundations and Trends in Information Retrieval,3(3):225–331, 2009.

[20] Z. Ma, A. Sun, Q. Yuan, and G. Cong. Tagging yourtweets: A probabilistic modeling of hashtagannotation in twitter. In Proceedings of the 23rd ACMInternational Conference on Conference onInformation and Knowledge Management, pages999–1008. ACM, 2014.

[21] C. D. Manning, P. Raghavan, and H. Schutze.Introduction to information retrieval, volume 1.Cambridge University Press Cambridge, 2008.

[22] A. Mazzia and J. Juett. Suggesting hashtags ontwitter. EECS 545m, Machine Learning, ComputerScience and Engineering, University of Michigan,2009.

[23] R. Nallapati. Discriminative models for informationretrieval. In Proceedings of the 27th annualinternational ACM SIGIR conference on Research anddevelopment in information retrieval, pages 64–71.ACM, 2004.

[24] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi.Bad news travel fast: A content-based analysis ofinterestingness on twitter. In Proceedings of the 3rdInternational Web Science Conference, page 8. ACM,2011.

[25] C. Quoc and V. Le. Learning to rank with nonsmoothcost functions. Proceedings of the Advances in NeuralInformation Processing Systems, 19:193–200, 2007.

[26] D. Sculley. Combined regression and ranking. InProceedings of the 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 979–988. ACM, 2010.

[27] B. Shi, G. Ifrim, and N. Hurley. Be in the know:Connecting news articles to relevant twitterconversations. arXiv preprint arXiv:1405.3117, 2014.

[28] X. Si and M. Sun. Tag-lda for scalable real-time tagrecommendation. Journal of ComputationalInformation Systems, 6(1):23–31, 2009.

[29] Y. Song, Z. Zhuang, H. Li, Q. Zhao, J. Li, W.-C. Lee,and C. L. Giles. Real-time automatic tagrecommendation. In Proceedings of the 31st annualinternational ACM SIGIR conference on Research anddevelopment in information retrieval, pages 515–522.ACM, 2008.

[30] M.-F. Tsai, T.-Y. Liu, T. Qin, H.-H. Chen, and W.-Y.Ma. Frank: a ranking method with fidelity loss. InProceedings of the 30th annual international ACMSIGIR conference on Research and development ininformation retrieval, pages 383–390. ACM, 2007.

[31] Twitter. Twitter.

1201

[32] I. H. Witten, E. Frank, and M. A. Hall. Data Mining:Practical Machine Learning Tools and Techniques:Practical Machine Learning Tools and Techniques.Elsevier, 2011.

[33] F. Xiao, T. Noro, and T. Tokuda. News-topic orientedhashtag recommendation in twitter based oncharacteristic co-occurrence word detection. In WebEngineering, pages 16–30. Springer, 2012.

[34] J. Xu and H. Li. Adarank: a boosting algorithm forinformation retrieval. In Proceedings of the 30thannual international ACM SIGIR conference onResearch and development in information retrieval,pages 391–398. ACM, 2007.

[35] S.-H. Yang, A. Kolcz, A. Schlaikjer, and P. Gupta.Large-scale high-precision topic modeling on twitter.In Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 1907–1916. ACM, 2014.

[36] H. Yu, C. Ho, Y. Juan, and C. Lin. Libshorttext: Alibrary for short-text classification and analysis.Technical report, Technical Report. http://www. csie.ntu. edu. tw/˜ cjlin/ papers/libshorttext. pdf, 2013.

1202