Big data @ Hootsuite analtyics
-
Upload
claudiu-coman -
Category
Engineering
-
view
305 -
download
2
description
Transcript of Big data @ Hootsuite analtyics
![Page 1: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/1.jpg)
Big Data@Hootsuite AnalyticsClaudiu Coman
![Page 2: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/2.jpg)
About us● You might have heard of uberVU● Acquired by Hootsuite to develop their new analytics
● New scalability challenges => millions of customers● Everything is in the cloud => Amazon● 10 M social media posts per day● 100+ Amazon EC2 instances of various sizes● 650 GB media posts/month● 400 GB worth of analytics data● 10+ MongoDB clusters
![Page 3: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/3.jpg)
What this presentation will be about
● What is the big data we’re working with● Infrastructure● Technologies● What we do on top of the data● How we currently display the data● What’s in store for the future
![Page 4: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/4.jpg)
Our data
● Our data is made up of media posts● Revolves around search queries● You can also connect accounts and pages● Sources: Twitter, Facebook, Google+, Wordpress, Flickr,
Picasa and others● To acquire data
○ a lot of REST○ some streaming○ very little scraping
![Page 5: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/5.jpg)
Mentions
● Every piece of data that we process○ gets normalized○ annotated○ convert everything to a standard JSON format
● We call the end result a mention
![Page 6: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/6.jpg)
Mentions (2)
![Page 7: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/7.jpg)
Annotations
● Language detection○ written in C++, wrapped in Python○ can process ~300 mentions/second
● Sentiment detection○ external provider○ piggybacking for efficiency
● Location detection○ in-house algorithm○ text tokenization and matching against a locations database
![Page 8: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/8.jpg)
The Pipeline
● Our processing infrastructure is a pipeline● Producer-Consumer pattern● Enables us to scale parts of the infrastructure separately
![Page 9: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/9.jpg)
The pipeline (2)
![Page 10: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/10.jpg)
The pipeline (3)
● 100+ consumer types● 450 consumer instances● Automatic scaling algorithm developed by us
○ whenever a consumer falls behind, the system deploys new consumer instances
○ automatically adjusts cluster size
![Page 11: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/11.jpg)
MongoDB● 10+ clusters● Our biggest cluster
○ 1500 operations/second○ m2.xlarge instances (17GB RAM, 6.5 ECU)○ 8x80 GB RAID10
● Hard to manage databases in multiple clusters○ we wrote mongo-pool https://github.com/uberVU/mongo-pool
● Cluster pyramid structure for cost efficiency● Communication between clusters through our own mongo-
oplogreplayhttps://github.com/uberVU/mongo-oplogreplay
![Page 12: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/12.jpg)
MongoDB mention clusters pyramid
![Page 13: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/13.jpg)
Kestrel
● Distributed message queue developed by Twitter● Uses Memcache protocol● Disk persistence● 400 consumer operations/second● Part of our pipeline core● Extremely reliable● Sending gzipped content to save I/O costs
![Page 14: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/14.jpg)
Redis, Memcache
● Gradually replacing Memcache with Redis● Used for high-access temporary information● 60 GB worth of data
![Page 15: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/15.jpg)
Other technologies
● RabbitMQ asynchronous tasks● DynamoDB
○ analytics○ auxiliary permanent storage use cases that don’t take a lot of
space● S3 for data with low access rates● Glacier for archived data
![Page 16: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/16.jpg)
System metrics & monitoring
● Graphite● 150K system metrics● Alerts are being generated based on Graphite● In-house alert-detection● Nagios/Nagstamon● PagerDuty for on-call
![Page 17: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/17.jpg)
Analytics overview
● Currently in DynamoDB● MongoDB still runs in parallel, considering a move back to
MongoDB● Breakdowns on language, location, sentiment, gender● Support for several resolutions (day, hour, 15min)● Optimized for language and location filtering (95% of our
queries)
![Page 18: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/18.jpg)
Aggregation pipeline
● We aggregate analytics to reduce writes● Efficient but simple concurrency through Redis primitives● Got a 5x improvement !
![Page 19: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/19.jpg)
Aggregation pipeline (2)
![Page 20: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/20.jpg)
Tagcloud
● Tagcloud algorithm that can detect n-grams of all lengths● Some of the data we analyze is blog content, can be very big● Needed something fast● In-house algorithm
○ linear complexity (doesn’t go up with max n-gram size)○ based on statistical correlations
![Page 21: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/21.jpg)
Signals
● Need to synthesize all the data we’re collecting● Top Stories
○ O(1) algorithm● Influencers
○ dependency graph○ edges are interactions between users
● Spikes & Bursts○ code written in C++ to reduce time○ statistical algorithms on top of our analytics timeseries○ adapt reads from analytics based on data size
![Page 22: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/22.jpg)
Boards
● Needed a good way to display all this data● Designed Boards
○ released this year○ allows you to create a dashboard with metric visualizations○ drag-drop widgets to arrange them your way
![Page 23: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/23.jpg)
Boards Demo
![Page 24: Big data @ Hootsuite analtyics](https://reader033.fdocuments.us/reader033/viewer/2022052411/557a85bcd8b42ac8638b482e/html5/thumbnails/24.jpg)
Future plans
● Hootsuite has millions of users● Our analytics infrastructure will have to scale
○ Transitioning to streaming services○ Larger MongoDB clusters
■ more shards for write throughput■ more secondaries for reads
● Add more metrics