Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra
-
Upload
ivmaykov -
Category
Technology
-
view
2.428 -
download
1
description
Transcript of Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra
Scaling Video Analytics
With Apache Cassandra
ILYA MAYKOV | Dec 6th, 2011
2
Ooyala – quick company overview
What do we mean by “video analytics”?
What are the challenges?
Cassandra at Ooyala - technical details
Lessons learned
Q&A
Agenda
3
4
5
6
7
8
9
10
Analytics Overview
11
12
3
4
1
2
Aggregate and Visualize Data
Optimize automagically
Give Insights
Enable experimentation
13
Analytics Overview
Go from this …
14
Analytics Overview
… to this …
15
Analytics Overview
… and this!
System Architecture
16
17
18
Collect vast amounts of data
Aggregate, slice in various dimensions
Report and visualize
Personalize and recommend
Scalable, fault tolerant, near real-time using Hadoop + Cassandra
State of Analytics Today
19
Scale
Processing Speed
Depth
Accuracy
Developer speed
Analytics Challenges
20
Challenge: Scale
150M+ unique monthly users
15M+ monthly video hours
Daily inflow: billions of log pings, TBs of uncompressed logs
10TB+ of historical analytics data in C* covering a period of about 4 years
Exponential data growth in C*: currently 1TB+ per month
21
Challenge: Processing Speed
Large “fan-out” to multiple dimensions + per-video-asset analytics = lots of data being written. Parallelizable!
“Analytics delay” metric = time from log ping hitting a server to being visible to a publisher in the analytics UI
Current avg. delay: 10-25 minutes depending on time of day
Target max analytics delay: <30 minutes (Hadoop system)
Would like <1 minute (future real-time processing system)
22
Challenge: Depth
Per-video-asset analytics means millions of new rows added and/or updated in each CF every day
10+ dimensions (CFs) for slicing data in different ways
Queries range from “everything in my account for all time” to “video X in city Y on date Z”
We’d like 1-hour granularity, but that’s up to 24x more rows
Or even 1-minute granularity in real-time, but that could be >1000x more rows …
23
Challenge: Accuracy
Publishers make business decisions based on analytics data
Ooyala makes business decisions based on analytics data
Ooyala bills publishers based on analytics data
Analytics need to be accurate and verifiable
24
Challenge: Developer Speed
We’re still a small company with limited developer resources
Like to iterate fast and release often, but …
… we use Hadoop MR for large-scale data processing
Hadoop is a Java framework
So, MapReduce jobs have to be written in Java … right?
25
Word Count Example: Java
26
Word Count Example: Ruby
27
Word Count Example: Scala
28
Challenge: Developer Speed
Word Count MR – Language Comparison
Lines
Characters
Development Speed
Runtime
Speed
Hadoop API
Java 69 2395 Low High Native
Ruby 30 738 High LowStreamin
g
Scala
35 1284 Medium High Native
Why Cassandra?
29
30
A bit of history
2008 – 2009: Single MySQL DB
Early 2010:
Too much data
Want higher granularity and more ways to slice data
Need a scalable data store!
31
Why Cassandra?
Linear scaling (space, load) – handles Scale & Depth challenges
Tunable consistency – QUORUM/QUORUM R/W allows accuracy
Very fast writes, reasonably fast reads
Great community support, rapidly evolving and improving codebase – 0.6.13 => 0.8.7 increased our performance by >4x
Simpler and fewer dependencies than Hbase, richer data model than a simple K/V store, more scalable than an RDBMS, …
32
Data Model - Overview
Row keys specify the entity and time (and some other stuff …)
Column families specify the dimension
Column names specify a data point within that dimension
Column values are maps of key/value pairs that represent a collection of related metrics
Different groups of related metrics are stored under different row keys
33
Data Model – Example
CF => Country
Column => “CA” “US” …
Keys
{video: 123, … }{ displays: 50, plays: 40, … }
{ displays: 100, plays: 75,
… }…
{publisher: 456, … }
{ displays: 5000, plays: 4100, … }
{ displays: 1100, plays:
756, … }…
… … … …
34
Data Model - Timestamps
Row keys have a timestamp component
Row keys have a time granularity component
Allows for efficient queries over large time ranges (few row keys with big numbers)
Preserves granularity at smaller time ranges
Currently Month/Week/Day. Maybe Hour/Minute in the future?
35
Data Model – Timestamps
“CA” “US” …
Keys
{ video: 123,day:
2011/10/31 }{ plays: 1, … } { plays: 1, … } …
{ video: 123,day:
2011/11/01 }{ plays: 2, … } { plays: 1, … } …
{ video: 123,day:
2011/11/02 }{ plays: 4, … } null …
{ video: 123,day:
2011/11/03 }{ plays: 8, … } { plays: 1, … } …
{ video: 123,day:
2011/11/04 }{ plays: 16, … } { plays: 1, … } …
{ video: 123,day:
2011/11/05 }{ plays: 32, … } { plays: 1, … } …
{ video: 123,day:
2011/11/06 }{ plays: 64, … } { plays: 1, … } …
{ video: 123,week:
2011/10/31 }
{ plays: 127, … }
{ plays: 6, … } …
36
Data Model – Metrics
Performance – plays, displays, unique users, time watched, bytes downloaded, etc
Sharing – tweets, facebook shares, diggs, etc
Engagement – how many users watched through certain time buckets of a video
QoS – bitrates, buffering events
Ad – ad requests, impressions, clicks, mouse-overs, failures, etc
37
Data Model - Metrics
CF => Country
Column => “CA” “US” …
Keys
{video: 123, metrics: video,
… }
{ displays: 50, plays: 40, … }
{ displays: 100, plays: 75,
… }…
{video: 123, metrics: ad, … }
{ clicks: 3, impressions:
40, … }
{ clicks: 7, impressions:
61, … }…
… … … …
38
Data Model - Dimensions
Analytics data is sliced in different dimensions == CFs
Example: country. Column names are “US”, “CA”, “JP”, etc
Column values are aggregates of the metric for the row key in that country
For example: the video performance metrics for month of 2011-10-01 in the US for video asset 123
Example: platform. Column names: “desktop:windows:chrome”, “tablet:ipad”, “mobile:android”, “settop:ps3”.
39
Data Model - Dimensions
CF: Country CF: DMA CF: Platform
“CA” “US” “SF Bay Area”
“NYC” “desktop:mac:chrome”
“settop:ps3”
Key: {video: 123, …}
{ plays: 20, … }
{ plays: 30, … }
{ plays: 12, … }
{ plays: 5, … }
{ plays: 60, … }
{ plays: 7, … }
40
Data Model – Indices
Need to efficiently answer “Top N” queries over an aggregate of multiple rows, sorted by some field in the metrics object
But, column sort order is “CA” < “JP” < “US” regardless of field values
Would like to support multiple fields to sort on, anyway
Naïve implementation – read entire rows, aggregate, sort in RAM – pretty slow
Solution: write additional index rows to C*
41
Data Model – Indices
Every data row may have 0 or more index rows, depending on the metrics type
Index rows – empty column values, column names are prepended with the value of the indexed field, encoded as a fixed-width byte array
Rely on C* to order the columns according to the indexed field
Index rows are stored in separate CFs which have “i_” prepended to the dimension name.
42
Data Model - IndicesCF => country
Column Name => “CA” “US” …
Keys
{video: 123, …} { displays: 50, plays: 40, … }
{ displays: 100, plays: 75,
… }…
{publisher: 456, …}
{ displays: 5000, plays: 4100, … }
{ displays: 1100, plays:
756, … }…
CF => i_country
Keys
{video: 123,index: plays}
Name: “40:CA”
Value: null
Name: “75:US”
Value: null…
{publisher: 456,index: displays}
Name: “5000:CA”Value: null
Name: “1100:US”Value: null
…
… … … …
43
Data Model – IndicesTrivial to answer a “Top N” query for a single row if the field we sort on has an index: just read the last N columns of the index row
What if the query spans multiple rows?
Use 3-pass uniform threshold algorithm. Guaranteed to get the top-N columns in any multi-row aggregate in 3 RPC calls. See: [http://www.cs.ucsb.edu/research/tech_reports/reports/2005-14.pdf]
Has some drawbacks: can’t do bottom-N, computing top-N-to-2N is impossible, have to do top-2N and drop half.
44
Data Model – Drilldowns
All cities in the world stored in one row, allowing us to do a global sort. What if we need cities within some region only?
Solution: use “drilldown” indices.
Just a special kind of index that includes only a subset of all data in the parent row.
Example: all cities in the country “US”
Works like regular index otherwise
Not free – more than 1/3rd of all our C* disk usage
45
The Bad Stuff
Read-modify-write is slow, because in C* read latency >> write latency
Having a write-only pipeline would greatly speed up processing, but makes reading data more expensive (aggregate-on-read)
And/or requires more complicated asynchronous aggregation
Minimum granularity of 1 day is not that good, would like to do 1-hour or 1-minute
But, storage requirements go up very fast
46
The Bad Stuff
Synchronous updates of time rollups and index rows make processing slower and increase delays
But, asynchronous is harder to get right
Reprocessing of data is currently difficult because of lack of locking – have to pause regular pipeline
Also have to reprocess log files in batches of full days
LESSONSLEARNED
47
48
PAINFUL
DATA MODELCHANGES
ARE
… so design to make them less so
49
BREAK
EVERYTHINGWILL
… so test accordingly
50
DATA
SEPARATELOGICALLYDIFFERENT
… it will improve performance AND make your life simpler
51
LOAD
PERF TESTWITH
PRODUCTION
… if you can afford a second cluster
http://cassandra.apache.org
http://www.datastax.com/dev
http://www.ooyala.com
52
THANK YOU