Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

Scaling Video Analytics

With Apache Cassandra

ILYA MAYKOV | Dec 6th, 2011

2

Ooyala – quick company overview

What do we mean by “video analytics”?

What are the challenges?

Cassandra at Ooyala - technical details

Lessons learned

Q&A

Agenda

Analytics Overview

11

12

3

4

1

2

Aggregate and Visualize Data

Optimize automagically

Give Insights

Enable experimentation

13

Analytics Overview

Go from this …

14

Analytics Overview

… to this …

15

Analytics Overview

… and this!

System Architecture

16

18

Collect vast amounts of data

Aggregate, slice in various dimensions

Report and visualize

Personalize and recommend

Scalable, fault tolerant, near real-time using Hadoop + Cassandra

State of Analytics Today

19

Scale

Processing Speed

Depth

Accuracy

Developer speed

Analytics Challenges

20

Challenge: Scale

150M+ unique monthly users

15M+ monthly video hours

Daily inflow: billions of log pings, TBs of uncompressed logs

10TB+ of historical analytics data in C* covering a period of about 4 years

Exponential data growth in C*: currently 1TB+ per month

21

Challenge: Processing Speed

Large “fan-out” to multiple dimensions + per-video-asset analytics = lots of data being written. Parallelizable!

“Analytics delay” metric = time from log ping hitting a server to being visible to a publisher in the analytics UI

Current avg. delay: 10-25 minutes depending on time of day

Target max analytics delay: <30 minutes (Hadoop system)

Would like <1 minute (future real-time processing system)

22

Challenge: Depth

Per-video-asset analytics means millions of new rows added and/or updated in each CF every day

10+ dimensions (CFs) for slicing data in different ways

Queries range from “everything in my account for all time” to “video X in city Y on date Z”

We’d like 1-hour granularity, but that’s up to 24x more rows

Or even 1-minute granularity in real-time, but that could be >1000x more rows …

23

Challenge: Accuracy

Publishers make business decisions based on analytics data

Ooyala makes business decisions based on analytics data

Ooyala bills publishers based on analytics data

Analytics need to be accurate and verifiable

24

Challenge: Developer Speed

We’re still a small company with limited developer resources

Like to iterate fast and release often, but …

… we use Hadoop MR for large-scale data processing

Hadoop is a Java framework

So, MapReduce jobs have to be written in Java … right?

25

Word Count Example: Java

26

Word Count Example: Ruby

27

Word Count Example: Scala

28

Challenge: Developer Speed

Word Count MR – Language Comparison

Lines

Characters

Development Speed

Runtime

Speed

Hadoop API

Java 69 2395 Low High Native

Ruby 30 738 High LowStreamin

g

Scala

35 1284 Medium High Native

Why Cassandra?

29

30

A bit of history

2008 – 2009: Single MySQL DB

Early 2010:

Too much data

Want higher granularity and more ways to slice data

Need a scalable data store!

31

Why Cassandra?

Linear scaling (space, load) – handles Scale & Depth challenges

Tunable consistency – QUORUM/QUORUM R/W allows accuracy

Very fast writes, reasonably fast reads

Great community support, rapidly evolving and improving codebase – 0.6.13 => 0.8.7 increased our performance by >4x

Simpler and fewer dependencies than Hbase, richer data model than a simple K/V store, more scalable than an RDBMS, …

32

Data Model - Overview

Row keys specify the entity and time (and some other stuff …)

Column families specify the dimension

Column names specify a data point within that dimension

Column values are maps of key/value pairs that represent a collection of related metrics

Different groups of related metrics are stored under different row keys

33

Data Model – Example

CF => Country

Column => “CA” “US” …

Keys

{video: 123, … }{ displays: 50, plays: 40, … }

{ displays: 100, plays: 75,

… }…

{publisher: 456, … }

{ displays: 5000, plays: 4100, … }

{ displays: 1100, plays:

756, … }…

… … … …

34

Data Model - Timestamps

Row keys have a timestamp component

Row keys have a time granularity component

Allows for efficient queries over large time ranges (few row keys with big numbers)

Preserves granularity at smaller time ranges

Currently Month/Week/Day. Maybe Hour/Minute in the future?

35

Data Model – Timestamps

“CA” “US” …

Keys

{ video: 123,day:

2011/10/31 }{ plays: 1, … } { plays: 1, … } …

{ video: 123,day:

2011/11/01 }{ plays: 2, … } { plays: 1, … } …

{ video: 123,day:

2011/11/02 }{ plays: 4, … } null …

{ video: 123,day:

2011/11/03 }{ plays: 8, … } { plays: 1, … } …

{ video: 123,day:

2011/11/04 }{ plays: 16, … } { plays: 1, … } …

{ video: 123,day:

2011/11/05 }{ plays: 32, … } { plays: 1, … } …

{ video: 123,day:

2011/11/06 }{ plays: 64, … } { plays: 1, … } …

{ video: 123,week:

2011/10/31 }

{ plays: 127, … }

{ plays: 6, … } …

36

Data Model – Metrics

Performance – plays, displays, unique users, time watched, bytes downloaded, etc

Sharing – tweets, facebook shares, diggs, etc

Engagement – how many users watched through certain time buckets of a video

QoS – bitrates, buffering events

Ad – ad requests, impressions, clicks, mouse-overs, failures, etc

37

Data Model - Metrics

CF => Country

Column => “CA” “US” …

Keys

{video: 123, metrics: video,

… }

{ displays: 50, plays: 40, … }


… }…

{video: 123, metrics: ad, … }

{ clicks: 3, impressions:

40, … }

{ clicks: 7, impressions:

61, … }…

… … … …

38

Data Model - Dimensions

Analytics data is sliced in different dimensions == CFs

Example: country. Column names are “US”, “CA”, “JP”, etc

Column values are aggregates of the metric for the row key in that country

For example: the video performance metrics for month of 2011-10-01 in the US for video asset 123

Example: platform. Column names: “desktop:windows:chrome”, “tablet:ipad”, “mobile:android”, “settop:ps3”.

39

Data Model - Dimensions

CF: Country CF: DMA CF: Platform

“CA” “US” “SF Bay Area”

“NYC” “desktop:mac:chrome”

“settop:ps3”

Key: {video: 123, …}

{ plays: 20, … }

{ plays: 30, … }

{ plays: 12, … }

{ plays: 5, … }

{ plays: 60, … }

{ plays: 7, … }

40

Data Model – Indices

Need to efficiently answer “Top N” queries over an aggregate of multiple rows, sorted by some field in the metrics object

But, column sort order is “CA” < “JP” < “US” regardless of field values

Would like to support multiple fields to sort on, anyway

Naïve implementation – read entire rows, aggregate, sort in RAM – pretty slow

Solution: write additional index rows to C*

41

Data Model – Indices

Every data row may have 0 or more index rows, depending on the metrics type

Index rows – empty column values, column names are prepended with the value of the indexed field, encoded as a fixed-width byte array

Rely on C* to order the columns according to the indexed field

Index rows are stored in separate CFs which have “i_” prepended to the dimension name.

42

Data Model - IndicesCF => country

Column Name => “CA” “US” …

Keys

{video: 123, …} { displays: 50, plays: 40, … }


… }…

{publisher: 456, …}

{ displays: 5000, plays: 4100, … }

{ displays: 1100, plays:

756, … }…

CF => i_country

Keys

{video: 123,index: plays}

Name: “40:CA”

Value: null

Name: “75:US”

Value: null…

{publisher: 456,index: displays}

Name: “5000:CA”Value: null

Name: “1100:US”Value: null

…

… … … …

43

Data Model – IndicesTrivial to answer a “Top N” query for a single row if the field we sort on has an index: just read the last N columns of the index row

What if the query spans multiple rows?

Use 3-pass uniform threshold algorithm. Guaranteed to get the top-N columns in any multi-row aggregate in 3 RPC calls. See: [http://www.cs.ucsb.edu/research/tech_reports/reports/2005-14.pdf]

Has some drawbacks: can’t do bottom-N, computing top-N-to-2N is impossible, have to do top-2N and drop half.

44

Data Model – Drilldowns

All cities in the world stored in one row, allowing us to do a global sort. What if we need cities within some region only?

Solution: use “drilldown” indices.

Just a special kind of index that includes only a subset of all data in the parent row.

Example: all cities in the country “US”

Works like regular index otherwise

Not free – more than 1/3rd of all our C* disk usage

45

The Bad Stuff

Read-modify-write is slow, because in C* read latency >> write latency

Having a write-only pipeline would greatly speed up processing, but makes reading data more expensive (aggregate-on-read)

And/or requires more complicated asynchronous aggregation

Minimum granularity of 1 day is not that good, would like to do 1-hour or 1-minute

But, storage requirements go up very fast

46

The Bad Stuff

Synchronous updates of time rollups and index rows make processing slower and increase delays

But, asynchronous is harder to get right

Reprocessing of data is currently difficult because of lack of locking – have to pause regular pipeline

Also have to reprocess log files in batches of full days

LESSONSLEARNED

47

48

PAINFUL

DATA MODELCHANGES

ARE

… so design to make them less so

49

BREAK

EVERYTHINGWILL

… so test accordingly

50

DATA

SEPARATELOGICALLYDIFFERENT

… it will improve performance AND make your life simpler

51

LOAD

PERF TESTWITH

PRODUCTION

… if you can afford a second cluster

http://cassandra.apache.org

http://www.datastax.com/dev

http://www.ooyala.com

52






http://www.ooyala.com/

http://www.ooyala.com/

THANK YOU

Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

Technology

Transcript of Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra