Post on 22-Jan-2018
1/19
Finding Missing Tweets using Topic Structure and Browsing Time
Finding Missing Tweetsusing Topic Structure and Browsing Time
Yu Suzuki†, Hiromitsu Ohara‡, Akiyo Nadamoto‡
† Nara Institute of Science and Technology, Japan‡ Konan University, Japan
5. December, 2017
2/19
Finding Missing Tweets using Topic Structure and Browsing Time
Introduction
Introduction
From Social Network Services (SNSs), there are massive volumes ofmessages.Users are not always on-line.
Users miss important information on SNSs.c.f.) A function on twitter “While you were away.” The structure ofsummarization is flat.
Users need to understand in a short time about the topics while theusers are off-line.
A mechanism of summarizing the tweets is useful.
We believe that when we summarize the tweets as a tree structure, theusers can easily understand the topics.
Summarize Tweets Using Topic Structure and Browsing Time
3/19
Finding Missing Tweets using Topic Structure and Browsing Time
Introduction
Why we consider topic structure?missing tweets topic sub topic
Today’s baseball game is exciting! baseball gameYesterday I went to baseball stadium baseball place
I’m at Salzburg! travel austriaI’m at baseball stadium! baseball place
· · · · · · · · ·
Tweets with minority topics are ignored if we summarize missing tweets.Missing tweets are mainly related to “baseball.”Only one tweet is related to “travel.”If these tweets are summarized without using topics, the tweet about travelmay not be appeared at the summary.
We visualize this topics of tweets as a tree structure.First, the users see top-level topics, such as “baseball” and “travel.”if the users are interested in “baseball,” the users browse “game” and “place.”Users do not miss a tweet about travel.
How to construct the topic structure?
4/19
Finding Missing Tweets using Topic Structure and Browsing Time
Introduction
Our contribution
1 Generate topic structures of tweets using the Wikipedia category treeand browsing time
We use Wikipedia category as a knowledge to construct tree structure.We use browsing time as a tweets which users miss.
2 Visualize the topic structure of tweets using a network graphWe implement our method using Web application.
3 Confirm using real dataset that our proposed method is effective forcommonly known topics
Our method is effective if there are many information about the theme.Wikipedia only have articles about commonly known topics.
5/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
Overview
2. Generate a Topic Graph
Wikipedia Category Tree
Tweets
1. Clustering of Tweets
C0 = Ichiro C1 = Masahiro
C3 = Human ➡ deletetoo wide to cover topics
Ichiro Masahiro
MLB playerSportsJapanese
Topic node: a parent node ofTweet clusters
3. Visualization
Ichiro Masahiro
Japanese MLB Player
Tweet listNow three of the greatest hitters in Major League history in one dugout with the Marlins. Barry Bonds, Ichiro and Don Kelly. amazing.
Joe Girardi discusses Masahiro Tanaka pitching on extended rest after Tuesday night's 9-0 victory.
Baseball
Sports
Basketball
Mariners
MLB
Players
Japan
Players
Abstract node: a parent node of
topic nodes
Topic Graph
tweets correspond to
category
about Ichiro
about Masahiro
6/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
Steps
Overview
1 Extracting missing tweet: Extracting which tweets are submittedduring user’s browsing time and it is before and after.
2 Clustering Tweets into Categories and extracting topics: UsingRepeated Bisection as clustering tools, we divide a set of tweets intoclusters and extract topics in each cluster.
3 Generate a topic graph: Using the topics of tweets and the Wikipediacategory tree, we generate a topic graph of the tweets.
4 Classify topics Classify the topics which are nodes of the topic graphas known topics and unknown topics.
5 Visualization of topic graphs: We visualize the topic graph and thecorresponding tweets using our implemented Web user interface.
7/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
0. Extraction of missing tweets
0. Extraction of missing tweets
We extract tweets which users have not browse.We assume that the browsing time is given.Browsing time may be available if we construct twitter client applications.
8/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
1. Clustering tweets
1. Clustering tweets
Tweets
1. Clustering of Tweets
C0 = Ichiro C1 = Masahiro
C3 = Human ➡ deletetoo wide to cover topics
We use repeated-bisection for clustering tweets.In our experiment, repeated-bisection is the most effective method forclustering short texts.Similar to k -means.
We remove noise clusters.We calculate the cosine similarity between each two texts in a cluster.We remove the nodes if the similarity is beyond the threshold.
9/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
1. Clustering tweets
Repeated bisectionGiven a set of tweets T , we extract a feature vector for each tweet. First, wedivide a tweet into the terms using morphological analysis or POS tagger.Then, we select noun and unknown terms as feature terms. The reason ofusing unknown terms is that these terms consist of slang and newly inventedwords which are not recognized by the morphological analysis. To clean thefeature terms, we select the terms which are included in more than twotweets. Feature vector f (ti) of tweet ti (ti ∈ T ) is defined as follows.
f (ti) = [tf (ti ,w1) · idf (w1), tf (ti ,w2) · idf (w2), · · · ,tf (ti ,wm) · idf (wm)] (1)
tf (ti ,wj) =
1 if wk appears at ti
more than once0 else
(2)
idf (wj) = − logdf (wj)
|T | (3)
where wj is a term in T , |T | is the number of tweets in T , tf (ti ,wj) indicateswhether wj appears at ti or not, df (wj) is the number of tweets which have wj ,and idf (wj) is an IDF (Inverted Document Frequency) value of wj where adocument is a tweet.
10/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
2. Topic graph
2. Generate a topic graph
2. Generate a Topic Graph
Ichiro Masahiro
MLB playerSportsJapanese
Topic node: a parent node ofTweet clusters
Abstract node: a parent node of
topic nodes
tweets correspond to
category
1 Generate a topic node corresponds to a tweet.
2 Generate a semantic node, which corresponds to a topic node.
3 Merge multiple nodes into simple structure of nodes.
11/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
2.1 Generate a topic node
Generate a topic node
Topic node: An Wikipedia article corresponds to a cluster.Method
1 Repeated bisection method outputs keywords for each category, with relateddegrees between keywords and categories.
2 We retrieve articles in Wikipedia, and find the most relevant category.
Many categories have their articles, then the categories are alsocandidates of topic nodes.
ExampleA category has keywords such that {(“baseball ′′, 1), (“player ′′, 0.5)}.There are two Wikipedia articles wp and wq :
The title of wp is “baseball team” and wq is “baseball player.”Calculate scores for each article:
wq = 1 + 0 = 1 , and wq = 1 + 0.5 = 1.5.
We select an article wq , “baseball player,” as a topic node.
12/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
2.2 Generate a semantic node
Generate a semantic node
Semantic node: The categories which correspond to the topic node onthe Wikipedia.Method
1 Get category names using Wikipedia.2 Prune unsuitable categories from semantic nodes using black list.
Person born in 19xx, Stub, A list of xx, . . .
ExampleCategory c0 is tagged by “Ichiro Suzuki, ”An article “Ichiro Suzuki” has two categories “Yankees Players” and“Baseball Players.”“Yankee players” and “Baseball players” are considered as semantic node.
Ichiro Suzuki Kenta Maeda
Yankees Player Baseball Player Baseball PlayerDodgers Player
13/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
2.3 Merge multiple nodes
Merge Multiple Nodes
Ichiro Suzuki Kenta Maeda
Yankees Player Baseball Player Baseball PlayerDodgers Player
Figure: Example of two network graphs
Ichiro Suzuki Kenta Maeda
Yankees Player Baseball Player Dodgers Player
Figure: Two nodes are merged iftwo graphs share the commonnodes.
Ichiro Suzuki Kenta Maeda
Yankees Player Baseball Player Dodgers Player
Sportspeople
Figure: If a leaf node and not leaf node correspond to the same article, these nodesare merged.
14/19
Finding Missing Tweets using Topic Structure and Browsing Time
Our Proposed Method
Visualization
Visualize topic nodes and semantic nodes3. Visualization
Ichiro Masahiro
Japanese MLB Player
Tweet listNow three of the greatest hitters in Major League history in one dugout with the Marlins. Barry Bonds, Ichiro and Don Kelly. amazing.
Joe Girardi discusses Masahiro Tanaka pitching on extended rest after Tuesday night's 9-0 victory.
Topic Graph
about Ichiro
about Masahiro
15/19
Finding Missing Tweets using Topic Structure and Browsing Time
Experiments
Experimental Setup
Experimental Setup
Aim of our experiment:To confirm that our method is effective or not.Which themes of tweets are appropriate for applying our proposed method.
Evaluation Measure: Precision ratioWe (the second author of our paper) manually select an appropriatecategories for each tweet.We calculate precision ratio for each category.
precision =The number of accurately categorized tweets
The number of tweets in the category
DatasetCategory: Politics, Music, Computer, Sports, and Animation/Games (fivecategories)Tweets: We prepared 2,000 tweets for each category. We use Twitter searchAPI.
16/19
Finding Missing Tweets using Topic Structure and Browsing Time
Experiments
Experimental Setup
Procedure of the experiment
1 Clustering 2,000 tweets for each theme, and extracting topics of eachcluster
2 Generate the topic graph using our proposed method3 Give clusters and their corresponding Wikipedia article titles to the
observers.Observers are hired using crowdsourcing (Crowdworks).
4 Observers evaluate whether the article titles are appropriate or not forrepresenting the clusters using the following five degrees (5:appropriate, 4: almost appropriate, 3: cannot say, 2: almostinappropriate, 1: inappropriate).
5 Summarize the observer’s evaluations, and analyze whether ourproposed method has good accuracy or not
17/19
Finding Missing Tweets using Topic Structure and Browsing Time
Experiments
Experimental Results
Experimetal Results
1.0-2.0 2.0-3.0 3.0-4.0 4.0-5.0
25
0
5
10
15
20
Average of evaluation scores
Num
ber o
f eva
luat
ion
scor
es
PoliticsMusicComputerSportsVideo Games
Table: Numbers ofevaluation scores forrespective bins.
Theme # obsv. Prec.Politics 8 0.72Music 11 0.56Computer 5 0.44Sports 5 0.42Animation 4 0.52& Games
Our method is useful for tweets about politics.There are many technical terms about politicsMany articles are on the WIkipedia.
Our method is not effective for computer, sports.There are wide variety of topics.Less number of articles are on the Wikipedia.
18/19
Finding Missing Tweets using Topic Structure and Browsing Time
Experiments
Experimental Results
Merging Multiple topics
��������
���� � �� � ��� �
���������
������� ����� ������ �������� �
������� ����� ����������
One example of a topic/semantic graphBlack node means topic node, and gray node means semantic node.
There is a topic node about “Yakult” and “Sofrbank.”Yakult: A Manufacturer of drinksSoftbank: A Carrer of Cell phoneBoth two companies are based in Tokyo.
We can connect two nodes using our proposed merging nodes ofmultiple topics.
19/19
Finding Missing Tweets using Topic Structure and Browsing Time
Conclusion
Conclusion
We proposed a method for automatically extracting user’s missing tweetsbased on topic granularity and missing time of browsing user.
We extract missing tweet based on the missing time.We propose a method for mapping a set of extracted missing tweets to theWikipedia category tree by considering topic structure granularity.
We confirmed that our proposed method is effective for “politics,” but noteffective for “computer” and “sports.”
Future WorkWe should consider resources other than Wikipedia as a knowledge base.
Wikipedia is not always suitable for personal topics.
We should consider synonyms.We should compare the other methods with our method.We should do a usability test of Web user interface.