Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

35
Hashtagger+: Real-time Social Tagging of Streaming News Georgiana Ifrim (joint work with Bichen Shi, Gevorg Poghosyan, Neil Hurley) Insight Centre for Data Analytics, University College Dublin, Ireland 1

Transcript of Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Page 1: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Hashtagger+: Real-time Social Tagging of Streaming News

Georgiana Ifrim (joint work with Bichen Shi, Gevorg Poghosyan, Neil Hurley)Insight Centre for Data Analytics,University College Dublin, Ireland

1

Page 2: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

The Umbrella Revolution: Sit-in street protests in Hong Kong, 2014

2

“The ants have megaphones now”C. Anderson

Page 3: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Sep21 Sep23 Sep25 Sep27 Sep29 Oct01 Oct03 Oct05 Oct07

#OccupyCentral #UmbrellaRevolution

#HongKong

Hong Kong students begin pro-democracy class boycott

Thousands at Hong Kong protest as Occupy Central is launched

Hong Kong protests: Thousands defy calls to go home

Hong Kong students vow stronger protests if leader stays

Hong Kong protests: Formal talks agreed as protests shrink

3

Page 4: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics

Motivation: News articles – Hashtag – Twitter conversation

#IndyRef(Referendum on Scottish Independence)

BBC: Scottish independence: Yes vote 'means big Scots EU boost'

BBC: Could Scotland compete on tax with Westminster?

IrishTimes: Brown promises more devolution for Scotland

RTE: Lloyds could move south if Scots vote for independence

Reuters: British PM heads to Scotland as independence campaign gathers steam

TheGuardian: Scottish independence: No camp sends for Gordon Brown as polls tighten

April 2016 4

Page 5: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Outline

•Problem Statement•State-of-the-Art•Hashtagger+ Model•Applications•Conclusion

5

Page 6: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics

Problem Statement

Map a stream of articles to a stream of

hashtags in real-time, with high-precision

and high-coverage.Joe Schmidt makes six changes to Irish side to face Japan #rugbyParis Airshow: eight takeaways from the major aerospace event #businessThe tortoise and the software: the human glitch in the machine #uxDuke of Edinburgh leaves hospital #princephilip

April 2016 6

Page 7: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics

Problem Statement•Real-time Recommendation: given an article, how quickly

can we recommend hashtags? (5mins ok, 5h not ok)

•High-precision (focused hashtags):

X Deadly car bomb targets Afghan bank #news

V Deadly car bomb targets Afghan bank #afghanistan #helmand

•High-coverage: how many articles get any recommended

hashtags within 5 minutes? (9 out of 10 ok, 1 out of 10 not ok)

April 2016 7

Page 8: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

State-of-the-Art•Modeling Approach:

• Multi-class Classification

• Content-based Features

• Static Datasets

•Workflow: Tweets Article

• Collect Tweets -> Hashtags as Classes -> Train Hashtag Classifiers -> Apply Classifiers to Article -> Recommended Hashtags

8

Page 9: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics

State-of-the-art:

• Multi-class classification (e.g., Naive Bayes, SVM, LDA, CNN)

• One hashtag = one class

• Content-based features

April 2016

#GE16

#ge16: Fine Gael and Fianna Fáil to discuss government options

Ruth Coppinger to be nominated for Taoiseach #GE16 #irishwater

#Germanwings"No evidence" that co-pilot told anyone he was planning #Germanwings crash, prosecutor says

9

Page 10: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics

State-of-the-art:

April 2016

#GE16

#ge16: Fine Gael and Fianna Fáil to discuss government optionsRuth Coppinger to be nominated for Taoiseach #GE16 #irishwater…

#Germanwings

"No evidence" that co-pilot told anyone he was planning #Germanwings crash, prosecutor says…

… …

ModelTrain

#Panama Papers

#PanamaPapers: Mossack Fonseca leak reveals elite's tax havens

#PanamaPapers: How the World's Rich and Famous Hide Their Money Offshore

How about new hashtags? Concept-drift of old hashtags?

#Germanwings:

One year on, Haltern commemorates the crash

Nice flight from Manchester to Koln/Bonn this morning.

Re-train the model

Weakness:Apply

Apply

10

Page 11: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics

Challenges:

•Many Classes: thousands of hashtags (e.g., 26k/day)

•Dynamic Classes: hashtags emerge and die-off

•Concept Drift: usage and meaning of hashtags changes

•Efficiency/coverage: real-time tagging to capture how the story moves over time

•Precision: state-of-the-art models have P@1 of ~50%

April 2016 11

Page 12: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Hashtagger+ Model

•Modeling Approach:

• Learning-to-rank (L2R)

• Focus on the concept of hashtag relevance

• IR Framework: Article = query, Hashtags = documents retrieved/ranked for the query

• Workflow: Article Tweets

12

Page 13: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Hashtagger+ Model

13

Page 14: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics

Hashtagger+ Model

April 2016

Object Class

Article1 Hashtagx Hashtagy Hashtagz

Article2 Hashtagx

Article3 Hashtagy Hashtagz

Object Class

(Article1 , Hashtagx) Relevant

(Article1 , Hashtagm) Irrelevant

(Article1 , Hashtagn) Irrelevant

SOTA: Multi-class Classification

Proposed L2R Model

14

Page 15: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

• Pointwise L2R model • Input feature vector xarticle,hashtag,time describes a given

(Article , Hashtag) pair at a point in time• Human provided label yarticle,hashtag,time tells if the hashtag is

relevant or irrelevant to the article, at that point in time

• Time-aware features capture how strongly a hashtag is associated with an article

Content SimilarityHashtag Popularity, Specificity, Trending

User Credibility

15

Hashtagger+ Model

Page 16: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics

Hashtagger+ Model:

April 2016

(Article1 , #GE16) 0.34 0.73 0 … Relevant

(Article1 , #Germanwings) 0.01 0.23 0 … Irrelevant

… … … … … …

(Article2 , #GE16) 0.02 0.48 0 … Irrelevant

(Article2 , #Germanwings) 0.76 0.45 1 … Relevant

… … … … … …

ModelTrain

How about new hashtags? Concept-drift of old hashtags?

Train once, use model (no retraining needed)

(Article1 , #PanamaPapers) 0.66 0.82 1 …

(Article2 , #PanamaPapers) 0.08 0.73 0 …

(Article1 , #Germanwings) 0.28 0.45 0 …

(Article2 , #Germanwings) 0.53 0.24 1 …

ApplyApply

16

Page 17: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics

Two-Step L2R Approach

• Filtering: Article -> Set of Candidate Hashtags

• Efficient Data Collection

• Query generation from given article

• Retrieving relevant tweets for article/query

• Ranking Model: Article, Candidate Hashtags -> Ranked

Hashtag List

• Apply pre-trained L2R model to rank candidate hashtags

April 2016 17

Page 18: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Hashtagger+ Model

18

Page 19: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Query Generation: Article -> Query

•What is a good set of keywords to describe what the article is about? (open research problem)

•How quickly can we generate the query?

•How good is the set of tweets retrieved with a given query?

•We compare 4 methods for query generation and the effect on quality & size of retrieved tweet set

19

Page 20: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Tweet Retrieval: Query -> Tweets

•Given a query (generated from an article), how do we quickly collect a good set of tweets?

•Cold-start Search for new articles:

• Re-use tweets collected for older articles

• How do we do this efficiently/effectively?

•Twitter Streaming API to continuously update tweet collection for each article

20

Page 21: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Experiments

•Query Generation

•Comparing L2R Algorithms

•Comparing to State-of-the-Art Methods

21

Page 22: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Query Generation

22

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO. 6

We study four approaches for generating keywordqueries from news articles:

1) POS + Tf.idf: Part-of-speech-tagging of thepseudo article. This method selects the first 5nouns/phrases by giving priority to noun-phrases,proper nouns, frequent nouns, all other nouns. Itthen pairs the single nouns to 2-grams and breakslong noun phrases into 2-grams. These pairs arethen ranked by the average tf.idf score of individ-ual terms, and the top 5 pairs are selected as thequery. This method is implemented by the modelpresented in [6].

2) POS + NER + Tf.idf: The second method fol-lows a similar process as above, but adds onestep of Named Entity Recognition (NER) after POS-tagging. We use the Stanford NER Tagger [36] to de-tect entities. When ranking all pairs by the averagetf.idf score, pairs containing entities get a 1.5 boost.The aim is to focus more on key entities, rather thangeneric words. We select the top 5 keyphrases as thefinal query.

3) AlchemyAPI3: This is a commercial tool devel-oped by IBM for extracting the most importantkeyphrases from a given article. Alchemy’s key-word extraction algorithm employs sophisticatedstatistical algorithms and natural language pro-cessing technology to analyze the article contentand identify the best keywords. Given the article’sfull content, AlchemyAPI returns a list of rankedkeyphrases. We take the top 5 keyphrases as thequery.

4) Article URL: Twitter Streaming API supports usingURLs as keyphrases. It retrieves tweets containingboth the full and shortened version of the targetURL. By using the article URL as a query, we re-trieve only relevant tweets that contain the articleURL (as in [16], [19]). We expect these tweets to becleaner regarding content, but very few.

Table 1 presents the keyphrases selected by each ap-proach for an example article. In Section 4.2 we present anempirical study to evaluate the impact of each query typeon the amount/quality of data collected, as well as how thisinfluences the recommendation effectiveness.

TABLE 1Example article and ranked article-keyphrases using 4 approaches.

Article Headline Easyjet doubles number of female pilots

Subheadline Easyjet says it has doubled the number of femalepilots this year and is on the hunt for more.

First Sentence The Amy Johnson initiative, named after the firstfemale pilot to fly solo from the UK to Australia,caused a surge in applications.

POS + Tf.idf (1) australia easyjet, (2) easyjet number, (3) easyjetuk, (4) australia number, (5) australia uk

POS + NER + Tf.idf (1) amy johnson, (2) australia easyjet, (3) easyjetuk, (4) australia uk, (5) easyjet number

AlchemyAPI (1) amy johnson initiative, (2) female pilots, (3)easyjet, (4) female pilot, (5) surge

URL (1) bbc.com/news/business-38326523

3.2.2 Cold-Start Search

In this section we describe methods to deal with cold-start: the challenge of quickly collecting data and providingrecommendations for new articles. In our proposed recom-mender model, the extraction of dynamic features relies ona collection of relevant tweets for each article, collected via

3. http://www.alchemyapi.com/products/alchemylanguage/keyword-extraction

search over historical data or via streaming. When usingonly streaming, for each new article our system waits untilsufficient tweets are gathered, before computing featuresand recommending hashtags to that article. Depending onthe popularity of the news event on Twitter, the waiting timevaries from a few minutes to a few hours. This is a problem,in particular in a journalistic setting where the editor cannotwait more than a few minutes for getting recommendations.Here we exploit the fact that many news are related toongoing stories, locations or events, and can benefit fromthe data and recommendations collected for past articles.

There are two types of Twitter APIs for collecting tweets:Search and Streaming. Given current time tn, the TwitterSearch API retrieves tweets posted before tn while theStreaming API retrieves tweets as they are posted on Twitter(after tn). For a real-time system, we naturally prefer usingexisting tweets, rather than waiting for new tweets. How-ever, the restricted rate limit of Twitter Search API (1 requestper min) makes it infeasible to use, given the amount oftweets the algorithm needs for a reliable recommendation(at least 100 tweets) and the number of articles processed bythe system. We therefore only use the Twitter Streaming APIto retrieve tweets. For search, we study methods that exploitthe historical tweets and articles collected by our system. Weexplore three ideas for using past data (visually sketched inFigure 4):

1) Searching Recent Tweets: Using the query ex-tracted for a given article, this method searches thedatabase for matching tweets and uses them forfeature computation.

2) Searching Past Stories: This method first clustersold articles into stories, then assigns the new articleto its most similar cluster and uses the majorityhashtag for that cluster as a recommendation.

3) Searching Similar Articles: This last approachsearches the database of past articles for articlessimilar to the new article and uses their tweets forfeature computation.

Searching Recent Tweets: This first approach uses theold tweets collected when streaming the previous newsarticles. Given a time-window w (e.g., past 12h from currenttime point), this method searches in the old articles’ tweet-bags for tweets that match any of the keyphrases of the newarticle. These old tweets are then added to the tweet-bagof the new article. After collecting a sufficient number ofrelevant old tweets (at least 100), we extract features fromthe tweet-bag and apply our L2R model. This method helpssolve the cold-start problem by computing features andrecommending hashtags as soon as a new article is received,without waiting for streamed tweets.

Algorithm 2: ColdStart1: Searching Recent TweetsInput: Query Qa extracted from article a, database ofpast articles D.Output: Relevant tweets Ta search, relevant hashtagsHa search.Method:Ta search = ;w = 60 ⇤ 12 //Timewindow for past tweets: 12hTw = SubsetTweets(D,w) //Retrieve all tweets within

timewindow w.for tweet 2 Tw do

for phrase 2 Qa doif phrase 2 tweet then

//If tweet contains the keyphrase phrase.Ta search = Ta search [ {tweet}break

Return Ta search, ;

Page 23: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Query Generation

23

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO. 9

TABLE 3P@1, article coverage, and running time of end-to-end hashtag recommendation using tweets collected using four query generation methods.

L2R (POS + Tf.idf) L2R (POS + NER + Tf.idf) L2R (AlchemyAPI) L2R (URL)P@1 0.930 0.947 0.901 0.410Coverage 67.3% 63.3% 71.3% 22.1%Time 301s 200s 588s 48s

TABLE 4Average cosine similarity, number of tweets, number of candidatehashtags and hashtag frequency using tweets collected using four

query generation methods.

POS +Tf.idf

POS + NER +Tf.idf

AlchemyAPI URL

Cosine 0.221 0.242 0.246 0.265Tweets 3696.2 2982.9 5083.8 4.2Hashtags 529 442 976 1.5Tag Freq 5.26 5.73 5.81 1.49

TABLE 5Comparing the P@1, NDCG@3 and running time of 16 ranking

methods using Ranklib, sklearn and Cornell’s RankSVM.

L2R Algorithm P@1 NDCG@3 Time(s)

Pointwise

RandomForest(sklearn) 0.852 0.848 2.75MultilayerPerceptron(sklearn) 0.835 0.803 6.14SVM(poly)(sklearn) 0.823 0.827 0.78GradientBoosting(sklearn) 0.810 0.817 1.71LinearRegression(sklearn) 0.803 0.824 0.16AdaBoost(sklearn) 0.801 0.840 1.51RandomForest(ranklib) 0.792 0.778 2.01MART(ranklib) 0.783 0.768 49.87GaussianNaiveBayes(sklearn) 0.764 0.757 0.05

PairwiseRankBoost(ranklib) 0.774 0.773 15.67RankSVM(cornell) 0.728 0.734 2.05RankNet(ranklib) 0.654 0.718 7.45

ListwiseCoordinateAscent(ranklib) 0.778 0.765 28.11LambdaMART(ranklib) 0.769 0.766 54.48ListNet(ranklib) 0.751 0.756 14.56AdaRank(ranklib) 0.737 0.749 2.53

listwise ranking algorithms, pointwise methods have higherP@1 and NDCG@3 (around 0.8), and shorter running time(around 2s). Among the 9 pointwise methods, Random-Forest(sklearn) has the highest P@1 (0.8526) and NDCG@3(0.8488) and acceptable training time (2.75s). MultilayerPreceptron, AdaBoost and Linear Regression have the sec-ond best P@1, NDCG@3 and running time. We tested twoimplementations of Random Forest (ranklib vs sklearn) andfound that the sklearn implementation has better rankingperformance, which is similar to prior findings9.

For the pairwise methods, RankBoost does the best inranking, but has the longest running time compared toRankNet, and RankSVM (15.67s, 7.45s, 2.05s). Among thefour listwise methods, AdaRank is the fastest (2.53s) whileLambdaMART is the slowest (54.48s). CoordinateAscent isthe best listwise L2R method, with P@1 of 0.77. Acrossthe three types of algorithms, the pointwise RandomForest(sklearn) has the highest P@1 and NDCG@3, with good run-ning time. In the followup experiments, we use pointwiseRandomForest for the relevance classifier.

Key Take-away. The common understanding in the IRcommunity is that listwise approaches typically outperformpairwise and pointwise approaches. We have found that thisis not the case in our setting. We work with binary relevancelabels and high class imbalance (the number of relevanthashtags for each article is low, in the range of 1-3 hashtags).In this setting, we find that pointwise approaches have clearadvantages regarding both ranking quality (higher preci-sion) and efficiency (no need to pair hundreds of hashtags).Our finding is similar to work in [33] on question answeringtasks with binary relevance judgements (answers are eitherrelevant or irrelevant), where there are very few relevantanswers (one correct answer for a given question, out ofhundreds of candidate answers). The authors in [33] found

9. sourceforge.net/p/lemur/discussion/ranklib/thread/1d30431d/

TABLE 6Time-window Size: Precision@1, Article Coverage, and total running

time of the hashtag recommendation using recent tweets.

1h 4h 12h 24h 36hTweets 49.8 258.1 1179.1 2247.4 3221.1P@1 0.820 0.801 0.824 0.883 0.850Coverage 26.0% 41.3% 48.0% 57.3% 60.0%Time 71s 132s 261s 456s 639s

that for this type of ranking problem, pointwise methodsoutperform pairwise and listwise ones. We have similarfindings from our extensive experiments with 16 L2R ap-proaches.

4.4 Cold-Start EvaluationTo evaluate and compare the proposed cold-start ap-proaches we gathered 150 test articles published between8am-12pm Nov 09, 2016. We also retrieve historical tweetsand articles stored in the system within a set time-windowsize (e.g., from 1 hour to 3 months).

4.4.1 Searching Recent Tweets

The first approach reuses the tweets collected for olderarticles.

Parameter Tuning: Time-window Size. The length ofthe search time-window is key for this method, as it decidesthe amount of historical tweets that will be used and thetime the method takes before making a recommendation.For each test article (of the 150), we retrieve tweets usingdifferent time-window sizes: 1h, 4h, 12h, 24h, 36h ahead ofthe publishing time of the test articles. There are about 500Khistorical tweets with hashtags collected between 8pm Nov07 to 12pm Nov 09, 2016 (around 36h before the articlespublishing time). Then, we apply the hashtag recommen-dation approach over the gathered tweets and measure therecommendation quality when using different time-windowsizes, by P@1, article coverage, total running time (searchingtime plus recommendation time), and the average numberof tweets retrieved for the 150 test articles.

Results. As shown in Table 6, longer searching time-windows allows retrieving more relevant tweets and thusleads to better recommendation coverage. Unfortunately,the running time also increases with the window size (51sfor 1h vs 339s for 36h). Using different time-windows hassmall impact on P@1, which is around 0.85. It shows therobustness of the hashtag recommendation approach, re-gardless of the number of tweets. Considering the articlecoverage and running time trade-offs, we chose 12h as thesearch time-window in the follow-up experiments.

4.4.2 Searching Past Stories

The second method uses k-means clustering and past rec-ommendations.

Parameter Tuning: Number of Clusters. The perfor-mance of this method depends on two factors: the numberof clusters and the size of the time-window for includ-ing historical articles. We test the number of clusters andconduct an end-to-end study to measure the final hashtagrecommendation quality. Given the total number of articlesN , we vary the number of clusters c in the range sqrt(N)

and 0.1N to 0.9N . The value sqrt(N) is a common wayto set the number of clusters, while the other values use a

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO. 9

TABLE 3P@1, article coverage, and running time of end-to-end hashtag recommendation using tweets collected using four query generation methods.

L2R (POS + Tf.idf) L2R (POS + NER + Tf.idf) L2R (AlchemyAPI) L2R (URL)P@1 0.930 0.947 0.901 0.410Coverage 67.3% 63.3% 71.3% 22.1%Time 301s 200s 588s 48s

TABLE 4Average cosine similarity, number of tweets, number of candidatehashtags and hashtag frequency using tweets collected using four

query generation methods.

POS +Tf.idf

POS + NER +Tf.idf

AlchemyAPI URL

Cosine 0.221 0.242 0.246 0.265Tweets 3696.2 2982.9 5083.8 4.2Hashtags 529 442 976 1.5Tag Freq 5.26 5.73 5.81 1.49

TABLE 5Comparing the P@1, NDCG@3 and running time of 16 ranking

methods using Ranklib, sklearn and Cornell’s RankSVM.

L2R Algorithm P@1 NDCG@3 Time(s)

Pointwise

RandomForest(sklearn) 0.852 0.848 2.75MultilayerPerceptron(sklearn) 0.835 0.803 6.14SVM(poly)(sklearn) 0.823 0.827 0.78GradientBoosting(sklearn) 0.810 0.817 1.71LinearRegression(sklearn) 0.803 0.824 0.16AdaBoost(sklearn) 0.801 0.840 1.51RandomForest(ranklib) 0.792 0.778 2.01MART(ranklib) 0.783 0.768 49.87GaussianNaiveBayes(sklearn) 0.764 0.757 0.05

PairwiseRankBoost(ranklib) 0.774 0.773 15.67RankSVM(cornell) 0.728 0.734 2.05RankNet(ranklib) 0.654 0.718 7.45

ListwiseCoordinateAscent(ranklib) 0.778 0.765 28.11LambdaMART(ranklib) 0.769 0.766 54.48ListNet(ranklib) 0.751 0.756 14.56AdaRank(ranklib) 0.737 0.749 2.53

listwise ranking algorithms, pointwise methods have higherP@1 and NDCG@3 (around 0.8), and shorter running time(around 2s). Among the 9 pointwise methods, Random-Forest(sklearn) has the highest P@1 (0.8526) and NDCG@3(0.8488) and acceptable training time (2.75s). MultilayerPreceptron, AdaBoost and Linear Regression have the sec-ond best P@1, NDCG@3 and running time. We tested twoimplementations of Random Forest (ranklib vs sklearn) andfound that the sklearn implementation has better rankingperformance, which is similar to prior findings9.

For the pairwise methods, RankBoost does the best inranking, but has the longest running time compared toRankNet, and RankSVM (15.67s, 7.45s, 2.05s). Among thefour listwise methods, AdaRank is the fastest (2.53s) whileLambdaMART is the slowest (54.48s). CoordinateAscent isthe best listwise L2R method, with P@1 of 0.77. Acrossthe three types of algorithms, the pointwise RandomForest(sklearn) has the highest P@1 and NDCG@3, with good run-ning time. In the followup experiments, we use pointwiseRandomForest for the relevance classifier.

Key Take-away. The common understanding in the IRcommunity is that listwise approaches typically outperformpairwise and pointwise approaches. We have found that thisis not the case in our setting. We work with binary relevancelabels and high class imbalance (the number of relevanthashtags for each article is low, in the range of 1-3 hashtags).In this setting, we find that pointwise approaches have clearadvantages regarding both ranking quality (higher preci-sion) and efficiency (no need to pair hundreds of hashtags).Our finding is similar to work in [33] on question answeringtasks with binary relevance judgements (answers are eitherrelevant or irrelevant), where there are very few relevantanswers (one correct answer for a given question, out ofhundreds of candidate answers). The authors in [33] found

9. sourceforge.net/p/lemur/discussion/ranklib/thread/1d30431d/

TABLE 6Time-window Size: Precision@1, Article Coverage, and total running

time of the hashtag recommendation using recent tweets.

1h 4h 12h 24h 36hTweets 49.8 258.1 1179.1 2247.4 3221.1P@1 0.820 0.801 0.824 0.883 0.850Coverage 26.0% 41.3% 48.0% 57.3% 60.0%Time 71s 132s 261s 456s 639s

that for this type of ranking problem, pointwise methodsoutperform pairwise and listwise ones. We have similarfindings from our extensive experiments with 16 L2R ap-proaches.

4.4 Cold-Start EvaluationTo evaluate and compare the proposed cold-start ap-proaches we gathered 150 test articles published between8am-12pm Nov 09, 2016. We also retrieve historical tweetsand articles stored in the system within a set time-windowsize (e.g., from 1 hour to 3 months).

4.4.1 Searching Recent Tweets

The first approach reuses the tweets collected for olderarticles.

Parameter Tuning: Time-window Size. The length ofthe search time-window is key for this method, as it decidesthe amount of historical tweets that will be used and thetime the method takes before making a recommendation.For each test article (of the 150), we retrieve tweets usingdifferent time-window sizes: 1h, 4h, 12h, 24h, 36h ahead ofthe publishing time of the test articles. There are about 500Khistorical tweets with hashtags collected between 8pm Nov07 to 12pm Nov 09, 2016 (around 36h before the articlespublishing time). Then, we apply the hashtag recommen-dation approach over the gathered tweets and measure therecommendation quality when using different time-windowsizes, by P@1, article coverage, total running time (searchingtime plus recommendation time), and the average numberof tweets retrieved for the 150 test articles.

Results. As shown in Table 6, longer searching time-windows allows retrieving more relevant tweets and thusleads to better recommendation coverage. Unfortunately,the running time also increases with the window size (51sfor 1h vs 339s for 36h). Using different time-windows hassmall impact on P@1, which is around 0.85. It shows therobustness of the hashtag recommendation approach, re-gardless of the number of tweets. Considering the articlecoverage and running time trade-offs, we chose 12h as thesearch time-window in the follow-up experiments.

4.4.2 Searching Past Stories

The second method uses k-means clustering and past rec-ommendations.

Parameter Tuning: Number of Clusters. The perfor-mance of this method depends on two factors: the numberof clusters and the size of the time-window for includ-ing historical articles. We test the number of clusters andconduct an end-to-end study to measure the final hashtagrecommendation quality. Given the total number of articlesN , we vary the number of clusters c in the range sqrt(N)

and 0.1N to 0.9N . The value sqrt(N) is a common wayto set the number of clusters, while the other values use a

Precision@1, article coverage and running time for hashtag recommendation

Page 24: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Comparing L2R algorithms

24

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO. 9

TABLE 3P@1, article coverage, and running time of end-to-end hashtag recommendation using tweets collected using four query generation methods.

L2R (POS + Tf.idf) L2R (POS + NER + Tf.idf) L2R (AlchemyAPI) L2R (URL)P@1 0.930 0.947 0.901 0.410Coverage 67.3% 63.3% 71.3% 22.1%Time 301s 200s 588s 48s

TABLE 4Average cosine similarity, number of tweets, number of candidatehashtags and hashtag frequency using tweets collected using four

query generation methods.

POS +Tf.idf

POS + NER +Tf.idf

AlchemyAPI URL

Cosine 0.221 0.242 0.246 0.265Tweets 3696.2 2982.9 5083.8 4.2Hashtags 529 442 976 1.5Tag Freq 5.26 5.73 5.81 1.49

TABLE 5Comparing the P@1, NDCG@3 and running time of 16 ranking

methods using Ranklib, sklearn and Cornell’s RankSVM.

L2R Algorithm P@1 NDCG@3 Time(s)

Pointwise

RandomForest(sklearn) 0.852 0.848 2.75MultilayerPerceptron(sklearn) 0.835 0.803 6.14SVM(poly)(sklearn) 0.823 0.827 0.78GradientBoosting(sklearn) 0.810 0.817 1.71LinearRegression(sklearn) 0.803 0.824 0.16AdaBoost(sklearn) 0.801 0.840 1.51RandomForest(ranklib) 0.792 0.778 2.01MART(ranklib) 0.783 0.768 49.87GaussianNaiveBayes(sklearn) 0.764 0.757 0.05

PairwiseRankBoost(ranklib) 0.774 0.773 15.67RankSVM(cornell) 0.728 0.734 2.05RankNet(ranklib) 0.654 0.718 7.45

ListwiseCoordinateAscent(ranklib) 0.778 0.765 28.11LambdaMART(ranklib) 0.769 0.766 54.48ListNet(ranklib) 0.751 0.756 14.56AdaRank(ranklib) 0.737 0.749 2.53

listwise ranking algorithms, pointwise methods have higherP@1 and NDCG@3 (around 0.8), and shorter running time(around 2s). Among the 9 pointwise methods, Random-Forest(sklearn) has the highest P@1 (0.8526) and NDCG@3(0.8488) and acceptable training time (2.75s). MultilayerPreceptron, AdaBoost and Linear Regression have the sec-ond best P@1, NDCG@3 and running time. We tested twoimplementations of Random Forest (ranklib vs sklearn) andfound that the sklearn implementation has better rankingperformance, which is similar to prior findings9.

For the pairwise methods, RankBoost does the best inranking, but has the longest running time compared toRankNet, and RankSVM (15.67s, 7.45s, 2.05s). Among thefour listwise methods, AdaRank is the fastest (2.53s) whileLambdaMART is the slowest (54.48s). CoordinateAscent isthe best listwise L2R method, with P@1 of 0.77. Acrossthe three types of algorithms, the pointwise RandomForest(sklearn) has the highest P@1 and NDCG@3, with good run-ning time. In the followup experiments, we use pointwiseRandomForest for the relevance classifier.

Key Take-away. The common understanding in the IRcommunity is that listwise approaches typically outperformpairwise and pointwise approaches. We have found that thisis not the case in our setting. We work with binary relevancelabels and high class imbalance (the number of relevanthashtags for each article is low, in the range of 1-3 hashtags).In this setting, we find that pointwise approaches have clearadvantages regarding both ranking quality (higher preci-sion) and efficiency (no need to pair hundreds of hashtags).Our finding is similar to work in [33] on question answeringtasks with binary relevance judgements (answers are eitherrelevant or irrelevant), where there are very few relevantanswers (one correct answer for a given question, out ofhundreds of candidate answers). The authors in [33] found

9. sourceforge.net/p/lemur/discussion/ranklib/thread/1d30431d/

TABLE 6Time-window Size: Precision@1, Article Coverage, and total running

time of the hashtag recommendation using recent tweets.

1h 4h 12h 24h 36hTweets 49.8 258.1 1179.1 2247.4 3221.1P@1 0.820 0.801 0.824 0.883 0.850Coverage 26.0% 41.3% 48.0% 57.3% 60.0%Time 71s 132s 261s 456s 639s

that for this type of ranking problem, pointwise methodsoutperform pairwise and listwise ones. We have similarfindings from our extensive experiments with 16 L2R ap-proaches.

4.4 Cold-Start EvaluationTo evaluate and compare the proposed cold-start ap-proaches we gathered 150 test articles published between8am-12pm Nov 09, 2016. We also retrieve historical tweetsand articles stored in the system within a set time-windowsize (e.g., from 1 hour to 3 months).

4.4.1 Searching Recent Tweets

The first approach reuses the tweets collected for olderarticles.

Parameter Tuning: Time-window Size. The length ofthe search time-window is key for this method, as it decidesthe amount of historical tweets that will be used and thetime the method takes before making a recommendation.For each test article (of the 150), we retrieve tweets usingdifferent time-window sizes: 1h, 4h, 12h, 24h, 36h ahead ofthe publishing time of the test articles. There are about 500Khistorical tweets with hashtags collected between 8pm Nov07 to 12pm Nov 09, 2016 (around 36h before the articlespublishing time). Then, we apply the hashtag recommen-dation approach over the gathered tweets and measure therecommendation quality when using different time-windowsizes, by P@1, article coverage, total running time (searchingtime plus recommendation time), and the average numberof tweets retrieved for the 150 test articles.

Results. As shown in Table 6, longer searching time-windows allows retrieving more relevant tweets and thusleads to better recommendation coverage. Unfortunately,the running time also increases with the window size (51sfor 1h vs 339s for 36h). Using different time-windows hassmall impact on P@1, which is around 0.85. It shows therobustness of the hashtag recommendation approach, re-gardless of the number of tweets. Considering the articlecoverage and running time trade-offs, we chose 12h as thesearch time-window in the follow-up experiments.

4.4.2 Searching Past Stories

The second method uses k-means clustering and past rec-ommendations.

Parameter Tuning: Number of Clusters. The perfor-mance of this method depends on two factors: the numberof clusters and the size of the time-window for includ-ing historical articles. We test the number of clusters andconduct an end-to-end study to measure the final hashtagrecommendation quality. Given the total number of articlesN , we vary the number of clusters c in the range sqrt(N)

and 0.1N to 0.9N . The value sqrt(N) is a common wayto set the number of clusters, while the other values use a

Page 25: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

•Multi-class Classification Methods:

• Use hashtagged tweets as labeled data (hashtag = class)

• Need to wait to collect enough training data (tweet history size, e.g., 2h or 4h of past tweets)

• Need to be retrained often to keep up with: changes in tweet vocabulary, emerging/dieing hashtags (retraining time, e.g., time required to train the model decides how often we can re-train)

• Naive Bayes, Liblinear SVM, Neural Net

• L2R Methods:

• Trained once with hashtaged tweets or manually labeled (article, hashtag) examples

• Pairwise L2R and Pointwise L2R (Hashtagger+)

25

Comparing to State-of-the-Art

Page 26: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Comparing to State-of-the-Art

26

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO. 12

Article Coverage

Pre

cisi

on

@1

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

166%,0.94(th=0.5)

77%,0.89(th=0.3)

All Articles

Hashtagger+ (search)Hashtagger (stream)PairwiseL2RLiblinear (2h/30min))Naive Bayes (4h/5min)MultilayerPerc (1h/1h)

Fig. 7. P@1 and article coverage of the SOTA methods compared.

the most similar 50 tweets and 50 articles from 4hof tweets posted ahead of the given article, andapply the trained model on the resulting candidatehashtags. The time window size is constrained bythe time needed to find similar tweets and articles(4h tweets require 5min). We compute 4 binaryfeatures for each candidate hashtag: (1) in the articleheadline, (2) is a popular url domain (e.g., bbc/rte),(3) matches the domain of the article URL, (4) is apopular hashtag (i.e., most frequent 100 hashtags).

2) Hashtagger: This is the pointwise L2R method pre-sented in [6] which is trained on manually labeledarticle-hashtag pairs, and uses only streaming fordata collection. It therefore needs to wait to gatherenough tweets for providing recommendations.

3) Hashtagger+: The method proposed in this paperuses pointwise L2R trained on manually labeledarticle-hashtag pairs and cold-start search to deliverrecommendations fast.

As an evaluation metric we use the article coverage and P@1(by recommending the maximum score prediction of eachmethod). Both these metrics are a function of the recom-mendation score. By setting a threshold on the recommen-dation score, we can study the sensitivity of each methodregarding both coverage and P@1. Intuitively, setting higherthresholds for selecting a recommendation, should result inbetter P@1, but less coverage, since recommendations thatdo not pass the threshold are not selected.

Experiment Setup. We randomly pick a starting timepoint t0 (Nov 18th 8am 2016, UTC), then run the experimentfor 6h, involving 150 articles and 400k tweets that have atleast one hashtag. Then each pseudo article (headline, sub-headline and first sentence) is considered as a rich tweet,and each method recommends one hashtag to each article.

Experiment Result. We asked a group of annotatorsto evaluate the resulting 900 test article-hashtag pairs asrelevant/irrelevant and averaged their results. For eachmethod, we change the threshold to each unique predictedvalue in increasing order, and record the article coveragerate and the P@1 at that threshold, as shown in Figure 7.We note that L2R methods deliver better P@1 and coverage,than content-based methods. We also find that our Hash-tagger+ solution leads to the highest P@1 and coverage,improving the SOTA in this area. For example, in [6] wereport a P@1 of 0.89 (for th = 0.5) and coverage of 66%. Incomparison Hashtagger+ delivers P@1 of 0.94 (for th = 0.5)and coverage of 66%. If we set the threshold for Hashtag-ger+ to get the same P@1 of 0.89 as in [6], we achieve acoverage of 77%, a 10% improvement in coverage. Regard-

Article Coverage

Pre

cisi

on

@1

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 80%, 0.94(th=0.5)

87%,0.89(th=0.3)

Popular Articles

Hashtagger+ (search)Hashtagger (stream)PairwiseL2RLiblinear (2h/30min))Naive Bayes (4h/5min)MultilayerPerc (1h/1h)

Article Coverage

Pre

cisi

on

@1

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 45%,0.94 (th=0.5)

58%,0.89(th=0.3)

Niche Articles

Fig. 8. P@1 and article coverage for popular versus niche articles.

ing efficiency, Hashtagger+ delivers recommendations inunder 1min, while Hashtagger needs to wait an average of80mins to deliver the first recommendations. PairwiseL2R,a model similar to [16], performs similar to the best content-based method, Liblinear, achieving around 0.5 P@1 at 100%coverage. In comparison, Hashtagger+ has more than 0.7P@1 at 100% coverage, vastly outperforming all comparedmethods.

4.5.1 Niche versus Popular Articles

In this experiment we analyze all methods with regard totheir capacity to provide good recommendations for articlesthat are popular versus articles that are more niche. Wedefine a popular article to be one for which we can retrieveat least 300 matching tweets, resulting in 50 niche and 100popular articles. We want to test the hypothesis that content-based methods are not able to deal well with niche articles,as they do not have enough content to train from. We expectL2R methods to be more robust with less training data.Figure 8 shows the SOTA comparison for the two settings.We note that for popular articles, content-based methodsbehave well, although their quality is well below that ofL2R. For niche articles the advantage of L2R methods iseven clearer. For a coverage of 70% of niche articles, thecontent-based methods deliver low P@1 of 0.5, while L2Rmethods have P@1 of 0.8.

Key Take-away: L2R methods outperform content-basedmethods for real-time hashtag recommendation. The pro-posed model Hashtagger+ clearly advances the SOTA byefficiently delivering high-precision, high-coverage recom-mendations (under 1min, P@1=0.89, coverage=77%).

Page 27: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

27

Comparing to State-of-the-Art: Popular vs Niche Articles

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO. 12

Article Coverage

Pre

cisi

on@

1

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

66%,0.94(th=0.5)

77%,0.89(th=0.3)

All Articles

Hashtagger+ (search)Hashtagger (stream)PairwiseL2RLiblinear (2h/30min))Naive Bayes (4h/5min)MultilayerPerc (1h/1h)

Fig. 7. P@1 and article coverage of the SOTA methods compared.

the most similar 50 tweets and 50 articles from 4hof tweets posted ahead of the given article, andapply the trained model on the resulting candidatehashtags. The time window size is constrained bythe time needed to find similar tweets and articles(4h tweets require 5min). We compute 4 binaryfeatures for each candidate hashtag: (1) in the articleheadline, (2) is a popular url domain (e.g., bbc/rte),(3) matches the domain of the article URL, (4) is apopular hashtag (i.e., most frequent 100 hashtags).

2) Hashtagger: This is the pointwise L2R method pre-sented in [6] which is trained on manually labeledarticle-hashtag pairs, and uses only streaming fordata collection. It therefore needs to wait to gatherenough tweets for providing recommendations.

3) Hashtagger+: The method proposed in this paperuses pointwise L2R trained on manually labeledarticle-hashtag pairs and cold-start search to deliverrecommendations fast.

As an evaluation metric we use the article coverage and P@1(by recommending the maximum score prediction of eachmethod). Both these metrics are a function of the recom-mendation score. By setting a threshold on the recommen-dation score, we can study the sensitivity of each methodregarding both coverage and P@1. Intuitively, setting higherthresholds for selecting a recommendation, should result inbetter P@1, but less coverage, since recommendations thatdo not pass the threshold are not selected.

Experiment Setup. We randomly pick a starting timepoint t0 (Nov 18th 8am 2016, UTC), then run the experimentfor 6h, involving 150 articles and 400k tweets that have atleast one hashtag. Then each pseudo article (headline, sub-headline and first sentence) is considered as a rich tweet,and each method recommends one hashtag to each article.

Experiment Result. We asked a group of annotatorsto evaluate the resulting 900 test article-hashtag pairs asrelevant/irrelevant and averaged their results. For eachmethod, we change the threshold to each unique predictedvalue in increasing order, and record the article coveragerate and the P@1 at that threshold, as shown in Figure 7.We note that L2R methods deliver better P@1 and coverage,than content-based methods. We also find that our Hash-tagger+ solution leads to the highest P@1 and coverage,improving the SOTA in this area. For example, in [6] wereport a P@1 of 0.89 (for th = 0.5) and coverage of 66%. Incomparison Hashtagger+ delivers P@1 of 0.94 (for th = 0.5)and coverage of 66%. If we set the threshold for Hashtag-ger+ to get the same P@1 of 0.89 as in [6], we achieve acoverage of 77%, a 10% improvement in coverage. Regard-

Article Coverage

Pre

cisi

on@

1

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 80%, 0.94(th=0.5)

87%,0.89(th=0.3)

Popular Articles

Hashtagger+ (search)Hashtagger (stream)PairwiseL2RLiblinear (2h/30min))Naive Bayes (4h/5min)MultilayerPerc (1h/1h)

Article Coverage

Pre

cisi

on@

1

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 45%,0.94 (th=0.5)

58%,0.89(th=0.3)

Niche Articles

Fig. 8. P@1 and article coverage for popular versus niche articles.

ing efficiency, Hashtagger+ delivers recommendations inunder 1min, while Hashtagger needs to wait an average of80mins to deliver the first recommendations. PairwiseL2R,a model similar to [16], performs similar to the best content-based method, Liblinear, achieving around 0.5 P@1 at 100%coverage. In comparison, Hashtagger+ has more than 0.7P@1 at 100% coverage, vastly outperforming all comparedmethods.

4.5.1 Niche versus Popular Articles

In this experiment we analyze all methods with regard totheir capacity to provide good recommendations for articlesthat are popular versus articles that are more niche. Wedefine a popular article to be one for which we can retrieveat least 300 matching tweets, resulting in 50 niche and 100popular articles. We want to test the hypothesis that content-based methods are not able to deal well with niche articles,as they do not have enough content to train from. We expectL2R methods to be more robust with less training data.Figure 8 shows the SOTA comparison for the two settings.We note that for popular articles, content-based methodsbehave well, although their quality is well below that ofL2R. For niche articles the advantage of L2R methods iseven clearer. For a coverage of 70% of niche articles, thecontent-based methods deliver low P@1 of 0.5, while L2Rmethods have P@1 of 0.8.

Key Take-away: L2R methods outperform content-basedmethods for real-time hashtag recommendation. The pro-posed model Hashtagger+ clearly advances the SOTA byefficiently delivering high-precision, high-coverage recom-mendations (under 1min, P@1=0.89, coverage=77%).

Page 28: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Applications

•Hashtagger+ is deployed in a Web application

(http://insight4news.ucd.ie)

•Using the recommended hashtags:

• News Publishing on Twitter

• Story Detection & Tracking

28

Page 29: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Live Tweeting with Hashtagger+ https://twitter.com/Insight4News3

29

Page 30: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics April 2016

No Hashtag #News Hashtagger

050

000

1000

0015

0000 Sum of Impressions

No Hashtag #News Hashtagger0

200

600

1000

Sum of Engagements

No Hashtag #News Hashtagger

020

040

060

0

Sum of Url Clicks

Twitter account (@insight4news3) automatically tweets article headlines.

Randomly allocate articles into 3 groups:

No Hashtag: Article Headline + URL

#News: Article Headline + URL + #News

Hashtagger: Article Headline + URL + Recommended Hashtags

Twitter Analytics Stats

30

Page 31: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Story Detection: http://ani.ucd.ie/plot_news_patterns.html

31

Page 32: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Story Tracking with Social Tags:

32

Page 33: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Story Tracking with Social Tags

33

Page 34: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Insight Centre for Data Analytics

Conclusion

April 2016

•Hashtagger+: a framework for real-time hashtag recommendation to news.

•L2R model trained with human-labeled data can address efficiency & precision challenges.

•By merging news and social media we can address difficult problems: story & entity detection/visualization/tracking/disambiguation/linking.

34

Page 35: Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim

Thank you!

References

•Hashtagger+: Efficient High-Coverage Social Tagging of Streaming News, B. Shi, G Poghosyan, G Ifrim, N Hurley [2017, under review]

•Learning-to-Rank for Real-Time High-Precision Hashtag Recommendation for Streaming News, B Shi, G Ifrim, N Hurley [WWW16]

•Real-time News Story Detection and Tracking with Hashtags, G. Poghosyan, G Ifrim [CNewsStory16]

•Topy: Real-time Story Tracking via Social Tags, G. Poghosyani, A. Qureshi, G Ifrim [ECML/PKDD16]

• Insight4news: Connecting news to relevant social conversations, B Shi, G Ifrim, N Hurley [ECML/PKDD14]

35