Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

54
Scaling Video Analytics With Apache Cassandra ILYA MAYKOV | Dec 6 th , 2011

description

 

Transcript of Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

Page 1: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

Scaling Video Analytics

With Apache Cassandra

ILYA MAYKOV | Dec 6th, 2011

Page 2: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

2

Ooyala – quick company overview

What do we mean by “video analytics”?

What are the challenges?

Cassandra at Ooyala - technical details

Lessons learned

Q&A

Agenda

Page 3: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

3

Page 4: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

4

Page 5: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

5

Page 6: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

6

Page 7: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

7

Page 8: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

8

Page 9: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

9

Page 10: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

10

Page 11: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

Analytics Overview

11

Page 12: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

12

3

4

1

2

Aggregate and Visualize Data

Optimize automagically

Give Insights

Enable experimentation

Page 13: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

13

Analytics Overview

Go from this …

Page 14: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

14

Analytics Overview

… to this …

Page 15: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

15

Analytics Overview

… and this!

Page 16: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

System Architecture

16

Page 17: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

17

Page 18: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

18

Collect vast amounts of data

Aggregate, slice in various dimensions

Report and visualize

Personalize and recommend

Scalable, fault tolerant, near real-time using Hadoop + Cassandra

State of Analytics Today

Page 19: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

19

Scale

Processing Speed

Depth

Accuracy

Developer speed

Analytics Challenges

Page 20: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

20

Challenge: Scale

150M+ unique monthly users

15M+ monthly video hours

Daily inflow: billions of log pings, TBs of uncompressed logs

10TB+ of historical analytics data in C* covering a period of about 4 years

Exponential data growth in C*: currently 1TB+ per month

Page 21: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

21

Challenge: Processing Speed

Large “fan-out” to multiple dimensions + per-video-asset analytics = lots of data being written. Parallelizable!

“Analytics delay” metric = time from log ping hitting a server to being visible to a publisher in the analytics UI

Current avg. delay: 10-25 minutes depending on time of day

Target max analytics delay: <30 minutes (Hadoop system)

Would like <1 minute (future real-time processing system)

Page 22: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

22

Challenge: Depth

Per-video-asset analytics means millions of new rows added and/or updated in each CF every day

10+ dimensions (CFs) for slicing data in different ways

Queries range from “everything in my account for all time” to “video X in city Y on date Z”

We’d like 1-hour granularity, but that’s up to 24x more rows

Or even 1-minute granularity in real-time, but that could be >1000x more rows …

Page 23: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

23

Challenge: Accuracy

Publishers make business decisions based on analytics data

Ooyala makes business decisions based on analytics data

Ooyala bills publishers based on analytics data

Analytics need to be accurate and verifiable

Page 24: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

24

Challenge: Developer Speed

We’re still a small company with limited developer resources

Like to iterate fast and release often, but …

… we use Hadoop MR for large-scale data processing

Hadoop is a Java framework

So, MapReduce jobs have to be written in Java … right?

Page 25: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

25

Word Count Example: Java

Page 26: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

26

Word Count Example: Ruby

Page 27: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

27

Word Count Example: Scala

Page 28: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

28

Challenge: Developer Speed

Word Count MR – Language Comparison

Lines

Characters

Development Speed

Runtime

Speed

Hadoop API

Java 69 2395 Low High Native

Ruby 30 738 High LowStreamin

g

Scala

35 1284 Medium High Native

Page 29: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

Why Cassandra?

29

Page 30: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

30

A bit of history

2008 – 2009: Single MySQL DB

Early 2010:

Too much data

Want higher granularity and more ways to slice data

Need a scalable data store!

Page 31: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

31

Why Cassandra?

Linear scaling (space, load) – handles Scale & Depth challenges

Tunable consistency – QUORUM/QUORUM R/W allows accuracy

Very fast writes, reasonably fast reads

Great community support, rapidly evolving and improving codebase – 0.6.13 => 0.8.7 increased our performance by >4x

Simpler and fewer dependencies than Hbase, richer data model than a simple K/V store, more scalable than an RDBMS, …

Page 32: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

32

Data Model - Overview

Row keys specify the entity and time (and some other stuff …)

Column families specify the dimension

Column names specify a data point within that dimension

Column values are maps of key/value pairs that represent a collection of related metrics

Different groups of related metrics are stored under different row keys

Page 33: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

33

Data Model – Example

CF => Country

Column => “CA” “US” …

Keys

{video: 123, … }{ displays: 50, plays: 40, … }

{ displays: 100, plays: 75,

… }…

{publisher: 456, … }

{ displays: 5000, plays: 4100, … }

{ displays: 1100, plays:

756, … }…

… … … …

Page 34: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

34

Data Model - Timestamps

Row keys have a timestamp component

Row keys have a time granularity component

Allows for efficient queries over large time ranges (few row keys with big numbers)

Preserves granularity at smaller time ranges

Currently Month/Week/Day. Maybe Hour/Minute in the future?

Page 35: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

35

Data Model – Timestamps

“CA” “US” …

Keys

{ video: 123,day:

2011/10/31 }{ plays: 1, … } { plays: 1, … } …

{ video: 123,day:

2011/11/01 }{ plays: 2, … } { plays: 1, … } …

{ video: 123,day:

2011/11/02 }{ plays: 4, … } null …

{ video: 123,day:

2011/11/03 }{ plays: 8, … } { plays: 1, … } …

{ video: 123,day:

2011/11/04 }{ plays: 16, … } { plays: 1, … } …

{ video: 123,day:

2011/11/05 }{ plays: 32, … } { plays: 1, … } …

{ video: 123,day:

2011/11/06 }{ plays: 64, … } { plays: 1, … } …

{ video: 123,week:

2011/10/31 }

{ plays: 127, … }

{ plays: 6, … } …

Page 36: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

36

Data Model – Metrics

Performance – plays, displays, unique users, time watched, bytes downloaded, etc

Sharing – tweets, facebook shares, diggs, etc

Engagement – how many users watched through certain time buckets of a video

QoS – bitrates, buffering events

Ad – ad requests, impressions, clicks, mouse-overs, failures, etc

Page 37: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

37

Data Model - Metrics

CF => Country

Column => “CA” “US” …

Keys

{video: 123, metrics: video,

… }

{ displays: 50, plays: 40, … }

{ displays: 100, plays: 75,

… }…

{video: 123, metrics: ad, … }

{ clicks: 3, impressions:

40, … }

{ clicks: 7, impressions:

61, … }…

… … … …

Page 38: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

38

Data Model - Dimensions

Analytics data is sliced in different dimensions == CFs

Example: country. Column names are “US”, “CA”, “JP”, etc

Column values are aggregates of the metric for the row key in that country

For example: the video performance metrics for month of 2011-10-01 in the US for video asset 123

Example: platform. Column names: “desktop:windows:chrome”, “tablet:ipad”, “mobile:android”, “settop:ps3”.

Page 39: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

39

Data Model - Dimensions

CF: Country CF: DMA CF: Platform

“CA” “US” “SF Bay Area”

“NYC” “desktop:mac:chrome”

“settop:ps3”

Key: {video: 123, …}

{ plays: 20, … }

{ plays: 30, … }

{ plays: 12, … }

{ plays: 5, … }

{ plays: 60, … }

{ plays: 7, … }

Page 40: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

40

Data Model – Indices

Need to efficiently answer “Top N” queries over an aggregate of multiple rows, sorted by some field in the metrics object

But, column sort order is “CA” < “JP” < “US” regardless of field values

Would like to support multiple fields to sort on, anyway

Naïve implementation – read entire rows, aggregate, sort in RAM – pretty slow

Solution: write additional index rows to C*

Page 41: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

41

Data Model – Indices

Every data row may have 0 or more index rows, depending on the metrics type

Index rows – empty column values, column names are prepended with the value of the indexed field, encoded as a fixed-width byte array

Rely on C* to order the columns according to the indexed field

Index rows are stored in separate CFs which have “i_” prepended to the dimension name.

Page 42: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

42

Data Model - IndicesCF => country

Column Name => “CA” “US” …

Keys

{video: 123, …} { displays: 50, plays: 40, … }

{ displays: 100, plays: 75,

… }…

{publisher: 456, …}

{ displays: 5000, plays: 4100, … }

{ displays: 1100, plays:

756, … }…

CF => i_country

Keys

{video: 123,index: plays}

Name: “40:CA”

Value: null

Name: “75:US”

Value: null…

{publisher: 456,index: displays}

Name: “5000:CA”Value: null

Name: “1100:US”Value: null

… … … …

Page 43: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

43

Data Model – IndicesTrivial to answer a “Top N” query for a single row if the field we sort on has an index: just read the last N columns of the index row

What if the query spans multiple rows?

Use 3-pass uniform threshold algorithm. Guaranteed to get the top-N columns in any multi-row aggregate in 3 RPC calls. See: [http://www.cs.ucsb.edu/research/tech_reports/reports/2005-14.pdf]

Has some drawbacks: can’t do bottom-N, computing top-N-to-2N is impossible, have to do top-2N and drop half.

Page 44: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

44

Data Model – Drilldowns

All cities in the world stored in one row, allowing us to do a global sort. What if we need cities within some region only?

Solution: use “drilldown” indices.

Just a special kind of index that includes only a subset of all data in the parent row.

Example: all cities in the country “US”

Works like regular index otherwise

Not free – more than 1/3rd of all our C* disk usage

Page 45: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

45

The Bad Stuff

Read-modify-write is slow, because in C* read latency >> write latency

Having a write-only pipeline would greatly speed up processing, but makes reading data more expensive (aggregate-on-read)

And/or requires more complicated asynchronous aggregation

Minimum granularity of 1 day is not that good, would like to do 1-hour or 1-minute

But, storage requirements go up very fast

Page 46: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

46

The Bad Stuff

Synchronous updates of time rollups and index rows make processing slower and increase delays

But, asynchronous is harder to get right

Reprocessing of data is currently difficult because of lack of locking – have to pause regular pipeline

Also have to reprocess log files in batches of full days

Page 47: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

LESSONSLEARNED

47

Page 48: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

48

PAINFUL

DATA MODELCHANGES

ARE

… so design to make them less so

Page 49: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

49

BREAK

EVERYTHINGWILL

… so test accordingly

Page 50: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

50

DATA

SEPARATELOGICALLYDIFFERENT

… it will improve performance AND make your life simpler

Page 51: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

51

LOAD

PERF TESTWITH

PRODUCTION

… if you can afford a second cluster

Page 53: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra

THANK YOU

Page 54: Cassandra nyc 2011   ilya maykov - ooyala - scaling video analytics with apache cassandra