Losing My Revolution:
Old Dominion UniversityDepartment of Computer Science
Hany SalahEldeen & Michael Nelson Losing My Revolution
How Many Resources Shared on Social Media Have Been Lost?
Hany M. SalahEldeen & Michael L. Nelson
All tweets are equal…
Hany SalahEldeen & Michael Nelson Losing My Revolution 2
…but some are more equal than the others
Research Questions:
Hany SalahEldeen & Michael Nelson Losing My Revolution 3
• How long would this last?• And if lost, is there a backup somewhere?• Finally, can we model this existence?
Phase 1:Data Gathering
Hany SalahEldeen & Michael Nelson Losing My Revolution 4
We decided to collect as many posts on social media as possible satisfying these conditions:• Has embedded resources.• Has a time stamp.• From different sources.• Related to socially significant events.
Data Gathering
Hany SalahEldeen & Michael Nelson Losing My Revolution 5
• From Twitter, Websites, Books:• The Egyptian revolution.
• From Twitter Only:• Stanford’s SNAP dataset:• Iranian elections.• H1N1 virus outbreak.• Michael Jackson’s death.• Obama’s Nobel Peace Prize.
• Twitter API:• The Syrian uprising.
Six Socially Significant Events
Hany SalahEldeen & Michael Nelson Losing My Revolution 6
Preparation:Stanford’s SNAP Dataset
Hany SalahEldeen & Michael Nelson Losing My Revolution 7
Extracted tweets in English only.
Contain hash tagsContain embedded resources
• With start with initial tags manually assigned related to the event and extract co-occurring ones
Twitter Tag Expansion
Hany SalahEldeen & Michael Nelson Losing My Revolution 8
Event Initial Hashtags Top Co-occurring Hashtags
H1N1 Outbreak #h1n1 = 61,351
#swine = 61,829#swineflu = 56,419
#flu = 8,436#pandemic = 6,839#influenza = 1,725
#grippe = 1,559#tamiflu = 331
#cnn = …….#health = …….
• We repeat this with all the other 3 events from SNAP
Twitter Tag Expansion
Hany SalahEldeen & Michael Nelson Losing My Revolution 9
• Using the expanded tags we sort them according to number of tweets and filter them by co-occurrence.
Tweet Filtration
Hany SalahEldeen & Michael Nelson Losing My Revolution 10
Event Hashtags selected for filteration Tweets Extracted
H1N1 Outbreak
#h1n1 = 61,351
#h1n1 & #swine = 44,972
#h1n1 & #swine & #swineflu = 42,574
#h1n1 & #swine & #swineflu & #pandemic = 5,517
Final Dataset Size = 5,517
• We repeat this for all the other 3 events.• We might need further random sampling to reduce the size of
the dataset
Tweet Filtration
Hany SalahEldeen & Michael Nelson Losing My Revolution 11
• The social media played a key role in documenting and driving the revolution.
• Millions of tweets, Facebook posts, videos, and images have been shared during the 18 days of the 25th January 2011 revolution.
• We manually extracted all the resources we can from the period of 20th January till March 1st.
• Hard to extract.
Egyptian Revolution Dataset
Hany SalahEldeen & Michael Nelson Losing My Revolution 12
Sources Utilized
Hany SalahEldeen & Michael Nelson Losing My Revolution 13
Tweets From Tahrir
IAmJan25.com
Storify.com
• Since this event was a current event we utilized the Twitter search API in the extraction process.
• Similar to the SNAP dataset, we applied hashtag expansion and filtration.
Syrian Uprising Dataset
Hany SalahEldeen & Michael Nelson Losing My Revolution 14
Initial Hashtags Top Co-occurring Hashtags
#syria
#bashar#risedamascus
#genocideinsyria#stopassad2012
#assadcrimes#assad
What are people sharing?
Hany SalahEldeen & Michael Nelson Losing My Revolution 15
For all the collected data, how many URIs are:1. unique and how many are repeated? 2. still active on the live web and how
many died?3. archived in one of the public web
archives?
Data Analysis
Hany SalahEldeen & Michael Nelson Losing My Revolution 16
Phase 2:Uniqueness and Existence
Hany SalahEldeen & Michael Nelson Losing My Revolution 17
UniquenessA URL can take many different forms utilizing numerous URL shortners
http://www.cnn.com
Could be:
http://bit.ly/2EEjBlhttp://goo.gl/2ViC
Hany SalahEldeen & Michael Nelson Losing My Revolution 18
% curl -I http://goo.gl/2ViCHTTP/1.1 301 Moved PermanentlyContent-Type: text/html; charset=UTF-8Cache-Control: no-cache, no-store, max-age=0, must-revalidatePragma: no-cacheExpires: Fri, 01 Jan 1990 00:00:00 GMTDate: Tue, 18 Sep 2012 01:08:44 GMTLocation: http://www.cnn.com/Server: GSETransfer-Encoding: chunked
Uniqueness
• Thus, we resolve all the URLs extracted and keep the final destination URL after redirects (30X redirects).
• Then we extract all the unique URLs and remove redundancies.
Hany SalahEldeen & Michael Nelson Losing My Revolution 19
Uniqueness
Hany SalahEldeen & Michael Nelson Losing My Revolution 20
Collection All Resources Unique Resources
5,517H1N1 Outbreak 1,645
2,293Michael Jackson 1,187
3,429Iran 1,340
1,118Obama 370
7,313Egypt 6,154
1,955Syria 355
Existence on the live-web
• For each unique URL we resolved the final HTTP response and considered 2 classes:• Success: 200 OK• Failure: 4XX, 50X families and the 30X loop
redirects or soft 404s.
Hany SalahEldeen & Michael Nelson Losing My Revolution 21
Existence on the live-web
Hany SalahEldeen & Michael Nelson Losing My Revolution 22
Collection Resources Missing Percentage Missing
394H1N1 Outbreak 23.95%
397Michael Jackson 33.45%
339Iran 25.30%
92Obama 24.86%
645Egypt 10.48%
25Syria 7.04%
Existence in Public Web-Archives
• For each unique URL we downloaded its timemap utilizing Memento.
• The aggregator checks 10+ public web archives for the existence of snapshots.
• The resource is declared to be archived if it has at least one Memento.
Hany SalahEldeen & Michael Nelson Losing My Revolution 23
Existence in Public Web-Archives
Hany SalahEldeen & Michael Nelson Losing My Revolution 24
Collection Resources Archived Percentage Archived
693H1N1 Outbreak 42.12%
406Michael Jackson 34.20%
516Iran 38.51%
176Obama 47.57%
1242Egypt 20.18%
19Syria 5.35%
Phase 3:Existence as a Function of Time
Hany SalahEldeen & Michael Nelson Losing My Revolution 25
Timeline of Events
Hany SalahEldeen & Michael Nelson Losing My Revolution 26
List of events
Social Events Having a Bimodal Time Distribution
Resources Missing & Archived
Hany SalahEldeen & Michael Nelson Losing My Revolution 27
Collection Percentage Missing Percentage Archived
23.49%H1N1 Outbreak 41.65%
36.24%Michael Jackson 39.45%
26.98%Iran 43.08%
24.59%Obama 47.87%
10.48%Egypt 20.18%
7.04%Syria 5.35%
31.62% 30.78%
24.47% 36.26%
25.64% 43.87%
26.15% 46.15%
Resources Missing & Archived
Hany SalahEldeen & Michael Nelson Losing My Revolution 28
Curve Fitting The Data
Hany SalahEldeen & Michael Nelson Losing My Revolution 29
Conclusions
• Measured 21,625 resources from 6 data sets in archives & live web.
• After a year from publishing about 11% of content shared on social media will be gone.
• After this we are losing roughly 0.02% daily.
Hany SalahEldeen & Michael Nelson Losing My Revolution 30
Appendix A:Extra Slides
Hany SalahEldeen & Michael Nelson Losing My Revolution
Stanford’s SNAP Dataset:• Collection of about 50 large network datasets.• Twitter posts dataset comprises nearly ½ Billion
Tweet.• Posted from June 1st 2009 till December 31st
2009.• Nearly 17 million users.• Nearly 20-30% of the total posts published by
Twitter during this period.
Data Gathering
Hany SalahEldeen & Michael Nelson Losing My Revolution
Existence as a function of timeDual-Peaked Events:• Iranian Elections:• 13th Jun. 2009: Protests and elections• 1st Aug. 2009: Trials
• Michael Jackson’s Death:• 25th Jun. 2009: Death announcement• 10th Jul. 2009: Death unnatural causes
• H1N1 Outbreak:• 11th Sept. 2009: Worldwide outbreak• 5th Oct. 2009: Vaccine release
• Obama’s Nobel Peace Prize:• 9th Oct. 2009: Prize announcement.• 10th Dec. 2009: Nobel Ceremony
Hany SalahEldeen & Michael Nelson Losing My Revolution
Back
Future WorkIn the next steps we will:• expand the datasets.• cover the uncovered temporal areas in 2010 and
before 2009.• examine closely the extended points and tune the
function with time.• analyze the other factors like: publishing venue,
rate of sharing, popularity of authors, and the nature of the event.
Hany SalahEldeen & Michael Nelson Losing My Revolution
Top Related