On Benchmarking Online Social Media Analytical Queries
-
Upload
weining-qian -
Category
Technology
-
view
290 -
download
0
description
Transcript of On Benchmarking Online Social Media Analytical Queries
On Benchmarking Online Social Media Analytical
QueriesWeining Qian
with Haixin Ma, Fan Xia, Jinxian Wei, Chengcheng Yu, and Aoying Zhou
http://database.ecnu.edu.cn/
6/23/2013 GRADES 2013 @ NY, USA 2
Outline
• Motivation• BSMA: Benchmark for Social Media
Analytical query processing– Data set– Queries– Measurements
• Preliminary results• Discussion/on-going work
6/23/2013 GRADES 2013 @ NY, USA 3
Motivation
• Social media has become a major source to sense the world– Emergent event monitoring, political election/stock
market predicting, product survey, etc.
• Social media = social network + media– Social network: large-scale static/dynamic networks– Media: content with timestamps
• Both collective behavior analysis and personalized data analysis has many applications– Variant kind of queries
6/23/2013 GRADES 2013 @ NY, USA 4
Motivation
• Many "big data" management/mining systems exist (and maybe more are coming)– Parallel RDBMS, NOSQL/NewSQL systems
(Hadoop-related ones, Cassandra, etc.)
• Which system/tech. is most suitable to a given problem?– A benchmark is needed
6/23/2013 GRADES 2013 @ NY, USA 5
Social media data
6/23/2013 GRADES 2013 @ NY, USA 6
Schema
6/23/2013 GRADES 2013 @ NY, USA 7
BSMA
Queries (to be extended/revised)
Data set(crawled from Sina Weibo)
Data generator(under development)
BSMA performance testing tool (based on YCSB)
6/23/2013 GRADES 2013 @ NY, USA 8
Data acquisition
• Crawled from Sina Weibo ("Chinese Twitter")
Haixin Ma, Weining Qian, Fan Xia, Xiaofeng He, Jun Xu, Aoying Zhou: Towards modeling popularity of microblogs. Frontiers of Computer Science 7(2): 171-
184 (2013)
6/23/2013 GRADES 2013 @ NY, USA 9
Data set
• Followship network– Seed users: 11 lawyers and opinion leaders and 21
researchers– 2nd level users from seeds: 120,000+ users– 3rd level users from seeds: 1.7+ million users– 4th level users from seeds: 18+ million users (incomplete)
• More than 1 billion following relationships– Tweets from 1.7+ million users– From Aug. 2009 to Jun. 2012– 480+ million tweets (about 51.11% of them are retweeted
tweets, and others are original tweets)
6/23/2013 GRADES 2013 @ NY, USA 10
Queries
• Queries on social networks– E.g. list common followees of uses A and B
• Queries on hotspots– Hotspots may be: users, tweets, topics, etc.– E.g. list the tweets with highest #retweet
• Queries on timelines– E.g. list 10 most recent tweets posted by
A's followees
6/23/2013 GRADES 2013 @ NY, USA 11
Query example (Q12)
⨝
⨝
⨝
Rank the tweets appearing in A's followees’ timelines according to the number of retweet.
6/23/2013 GRADES 2013 @ NY, USA 12
BSMA performance testing tool based on YCSB
• YCSB: Yahoo Cloud Service Benchmark– http://wiki.github.com/brianfrankcooper/
YCSB/
• BSMA modifications– Query argument and parameter generation
• User IDs, top-k, timespan, etc.
– Query wrappers– https://github.com/xiafan68/BSMA
6/23/2013 GRADES 2013 @ NY, USA 13
Measurements
• Throughput– The highest throughput of the system under
different settings of number of threads
• Latency– The (average) latency of the system under
the setting with the 2nd highest throughput
• Scalability– The slope of the throughput/latency plot
6/23/2013 GRADES 2013 @ NY, USA 14
WISE 2012 Challenge Performance Track
• A preliminary version of BSMA is used in WISE 2012 Challenge Performance Track
• 4 teams– A special purpose (in-memory) system– A Hbase-based system with secondary index– A SQLLite-based system with many
optimizations– A special purpose system with B+-tree
optimizations for different kind of queries
6/23/2013 GRADES 2013 @ NY, USA 15
Results Find the set of people who share the same followee with the specified user.
6/23/2013 GRADES 2013 @ NY, USA 16
Difficulties
• Joins of very large tables
• Skewness of the data distribution– Power-law
distribution
• Preserving the orders in results
6/23/2013 GRADES 2013 @ NY, USA 17
Future work
• Data generator– More than a social
network generator– Simulate user
activities• Followship network• Tweeting and
retweeting actions• Timeline• Topics
6/23/2013 GRADES 2013 @ NY, USA 18
Future work
• Queries related to content of tweets– Queries with keyword search– Real-life data set needed
• More queries
• Performance testing of more systems– RDBMS, graph database, etc.
6/23/2013 GRADES 2013 @ NY, USA 19
More on BSMA
• Original WISE 2012 Challenge page– http://www.wise2012.cs.ucy.ac.cy/
challenge.html• WISE 2012 Challenge follow-up
information– https://wnqian.wordpress.com/research/
wise2012challenge/• BSMA performance testing tool
– https://github.com/xiafan68/BSMA• Suggestions or comments are welcome!
– Mailto: [email protected]
Thanks!
http://database.ecnu.edu.cn/