Tweet Summarizer
Click here to load reader
-
Upload
srikanthkandalam1 -
Category
Data & Analytics
-
view
259 -
download
0
description
Transcript of Tweet Summarizer
![Page 1: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/1.jpg)
Tweet SummarizationSai Madhuri B, Srikanth K S
![Page 2: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/2.jpg)
Project Details
Project Name: Tweet Summarization
Problem definition: For a given keyword K, a hashtag set H of unprocessed stream of tweets R is processed to get P which includes the set of tweets obtained from the co-occurrence tweet extractor. The set P consists of subsets p1 , p2 , ... , pn which correspond to tweets segregated into sub topic1 , subtopic2 , ..., sub topicn .
Dataset: From Twitter RestfulAPI
Gold Standard: Human evaluation
![Page 3: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/3.jpg)
Extraction
● Twitter text● Time the tweet was created at● Screen name of the user● Follower count of the user● Favorite count of the user● Favorited flag of tweet● Retweeted flag of tweet● Retweet count of tweet
![Page 4: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/4.jpg)
Filtering
● Converted HTML- encoded characters into ASCII.● Removed any Unicode characters.● Filtered out embedded URL's.● Removed the re-tweets.● Removed the handlers.● Removed the hashtags.● Removed the tweets whose length is less than 5 words.
![Page 5: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/5.jpg)
Distribution of Tweets1752 tweets divided into 4 clusters.
Cluster 1 - tweets calling MH370 a bluff
Cluster 2 - tweets related to a certain golf star being attacked by hornets in Malaysia
Cluster 3 - tweets with information about a certain phase of MH370 search (SAR Mission)
Cluster 4 - tweets representing another phase of the 'MH370 search'
Clustering - Baseline
![Page 6: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/6.jpg)
Bursty topic model
● Binomial Distribution (Fung et al)● Sub topic segmentation
Step - 1:Associated words
![Page 7: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/7.jpg)
Bursty topic modelStep - 2:
Lifetime of sub topic
- set of words in association word set
![Page 8: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/8.jpg)
ImplementationRepresentationStep - 1
● words as nodes● association as edge weight● find components to determine associated sets
Step - 2● tune to obtain overlapping life times
![Page 9: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/9.jpg)
RankingLex rank
Input
- tweets from subtopics
Output- ranked tweets per sub topic
![Page 10: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/10.jpg)
Human Evaluation
● Choosing number of categories.
● Discarding uninformative and incoherent tweets
● Ranking them for each cluster as per the richness of information and coherence
● Clubbing all the ranked tweets to obtain summary
![Page 11: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/11.jpg)
Human Evaluation
A. Funny (Hold the front page! I've found the black box!!! Really sorry, it's been in my kit room all along)B. Sarcastic (All psychics in the world should gather for a psychic convention to solve MH370 mysteries) C. Uninformative ('my heart will go on' song made me think… what if a film about mh370 is made?) D. Unrelated (Larrazabal stung by hornets in Malaysia) E. Predictive ( I am guessing flight is in the Warthon Basin floor, Indian Ocean, but it has to be proven)
![Page 12: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/12.jpg)
ROUGEMetrics used:
● precision● recall
Evaluation Models:1. Bursty topic model 2. Clustering
![Page 13: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/13.jpg)
ResultsWe have performed human evaluation using two volunteers. Below we present the results we have obtained from ROUGE evaluation tool kit.
Evaluation Precision Recall
Human 1 vs Clustering 0.18060 0.08120
Human 2 vs Clustering 0.21070 0.19444
Human 1 vs Human 2 0.41358 0.20150
Human 1 vs Bursty Topic Model 0.29032 0.18947
Human 2 vs Bursty Topic Model 0.27880 0.37346
![Page 14: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/14.jpg)
Intuitions
● Human 1 vs Human 2 difference● Example: Tweet 1: ‘Sub search will end in another week’ Tweet 2: ‘Sub marine search will be called off in a week’● Bursty topic model imitates human evaluation to a
better extent rather than clustering because of the temporal way of sub topic classification.
![Page 15: Tweet Summarizer](https://reader038.fdocuments.us/reader038/viewer/2022100517/554d4a09b4c9053c678b5430/html5/thumbnails/15.jpg)
FutureBelow are few improvements that can be done to our model:
1. It might yield us better results if we incorporate grammatical checking on tweets
2. Tweak Lex Rank to accommodate user popularity as edge weights