Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

Scaling Video Analytics

With Apache Cassandra

ILYA MAYKOV | Dec 6th, 2011

Ooyala – quick company overview

What do we mean by “video analytics”?

What are the challenges?

Cassandra at Ooyala - technical details

Lessons learned

Agenda

Analytics Overview

Aggregate and Visualize Data

Optimize automagically

Give Insights

Enable experimentation

Analytics Overview

Go from this …

Analytics Overview

… to this …

Analytics Overview

… and this!

System Architecture

Collect vast amounts of data

Aggregate, slice in various dimensions

Report and visualize

Personalize and recommend

Scalable, fault tolerant, near real-time using Hadoop + Cassandra

State of Analytics Today

Processing Speed

Accuracy

Developer speed

Analytics Challenges

Challenge: Scale

150M+ unique monthly users

15M+ monthly video hours

Daily inflow: billions of log pings, TBs of uncompressed logs

10TB+ of historical analytics data in C* covering a period of about 4 years

Exponential data growth in C*: currently 1TB+ per month

Challenge: Processing Speed

Large “fan-out” to multiple dimensions + per-video-asset analytics = lots of data being written. Parallelizable!

“Analytics delay” metric = time from log ping hitting a server to being visible to a publisher in the analytics UI

Current avg. delay: 10-25 minutes depending on time of day

Target max analytics delay: <30 minutes (Hadoop system)

Would like <1 minute (future real-time processing system)

Challenge: Depth

Per-video-asset analytics means millions of new rows added and/or updated in each CF every day

10+ dimensions (CFs) for slicing data in different ways

Queries range from “everything in my account for all time” to “video X in city Y on date Z”

We’d like 1-hour granularity, but that’s up to 24x more rows

Or even 1-minute granularity in real-time, but that could be >1000x more rows …

Challenge: Accuracy

Publishers make business decisions based on analytics data

Ooyala makes business decisions based on analytics data

Ooyala bills publishers based on analytics data

Analytics need to be accurate and verifiable

Challenge: Developer Speed

We’re still a small company with limited developer resources

Like to iterate fast and release often, but …

… we use Hadoop MR for large-scale data processing

Hadoop is a Java framework

So, MapReduce jobs have to be written in Java … right?

Word Count Example: Java

Word Count Example: Ruby

Word Count Example: Scala

Challenge: Developer Speed

Word Count MR – Language Comparison

Characters

Development Speed

Runtime

Hadoop API

Java 69 2395 Low High Native

Ruby 30 738 High LowStreamin

35 1284 Medium High Native

Why Cassandra?

A bit of history

2008 – 2009: Single MySQL DB

Early 2010:

Too much data

Want higher granularity and more ways to slice data

Need a scalable data store!

Why Cassandra?

Linear scaling (space, load) – handles Scale & Depth challenges

Tunable consistency – QUORUM/QUORUM R/W allows accuracy

Very fast writes, reasonably fast reads

Great community support, rapidly evolving and improving codebase – 0.6.13 => 0.8.7 increased our performance by >4x

Simpler and fewer dependencies than Hbase, richer data model than a simple K/V store, more scalable than an RDBMS, …

Data Model - Overview

Row keys specify the entity and time (and some other stuff …)

Column families specify the dimension

Column names specify a data point within that dimension

Column values are maps of key/value pairs that represent a collection of related metrics

Different groups of related metrics are stored under different row keys

Data Model – Example

CF => Country

Column => “CA” “US” …

{video: 123, … }{ displays: 50, plays: 40, … }

{ displays: 100, plays: 75,

… }…

{publisher: 456, … }

{ displays: 5000, plays: 4100, … }

{ displays: 1100, plays:

756, … }…

… … … …

Data Model - Timestamps

Row keys have a timestamp component

Row keys have a time granularity component

Allows for efficient queries over large time ranges (few row keys with big numbers)

Preserves granularity at smaller time ranges

Currently Month/Week/Day. Maybe Hour/Minute in the future?

Data Model – Timestamps

“CA” “US” …

{ video: 123,day:

2011/10/31 }{ plays: 1, … } { plays: 1, … } …

{ video: 123,day:

2011/11/01 }{ plays: 2, … } { plays: 1, … } …

{ video: 123,day:

2011/11/02 }{ plays: 4, … } null …

{ video: 123,day:

2011/11/03 }{ plays: 8, … } { plays: 1, … } …

{ video: 123,day:

2011/11/04 }{ plays: 16, … } { plays: 1, … } …

{ video: 123,day:

2011/11/05 }{ plays: 32, … } { plays: 1, … } …

{ video: 123,day:

2011/11/06 }{ plays: 64, … } { plays: 1, … } …

{ video: 123,week:

2011/10/31 }

{ plays: 127, … }

{ plays: 6, … } …

Data Model – Metrics

Performance – plays, displays, unique users, time watched, bytes downloaded, etc

Sharing – tweets, facebook shares, diggs, etc

Engagement – how many users watched through certain time buckets of a video

QoS – bitrates, buffering events

Ad – ad requests, impressions, clicks, mouse-overs, failures, etc

Data Model - Metrics

CF => Country

Column => “CA” “US” …

{video: 123, metrics: video,

{ displays: 50, plays: 40, … }

… }…

{video: 123, metrics: ad, … }

{ clicks: 3, impressions:

40, … }

{ clicks: 7, impressions:

61, … }…

… … … …

Data Model - Dimensions

Analytics data is sliced in different dimensions == CFs

Example: country. Column names are “US”, “CA”, “JP”, etc

Column values are aggregates of the metric for the row key in that country

For example: the video performance metrics for month of 2011-10-01 in the US for video asset 123

Example: platform. Column names: “desktop:windows:chrome”, “tablet:ipad”, “mobile:android”, “settop:ps3”.

Data Model - Dimensions

CF: Country CF: DMA CF: Platform

“CA” “US” “SF Bay Area”

“NYC” “desktop:mac:chrome”

“settop:ps3”

Key: {video: 123, …}

{ plays: 20, … }

{ plays: 30, … }

{ plays: 12, … }

{ plays: 5, … }

{ plays: 60, … }

{ plays: 7, … }

Data Model – Indices

Need to efficiently answer “Top N” queries over an aggregate of multiple rows, sorted by some field in the metrics object

But, column sort order is “CA” < “JP” < “US” regardless of field values

Would like to support multiple fields to sort on, anyway

Naïve implementation – read entire rows, aggregate, sort in RAM – pretty slow

Solution: write additional index rows to C*

Data Model – Indices

Every data row may have 0 or more index rows, depending on the metrics type

Index rows – empty column values, column names are prepended with the value of the indexed field, encoded as a fixed-width byte array

Rely on C* to order the columns according to the indexed field

Index rows are stored in separate CFs which have “i_” prepended to the dimension name.

Data Model - IndicesCF => country

Column Name => “CA” “US” …

{video: 123, …} { displays: 50, plays: 40, … }

… }…

{publisher: 456, …}

{ displays: 5000, plays: 4100, … }

{ displays: 1100, plays:

756, … }…

CF => i_country

{video: 123,index: plays}

Name: “40:CA”

Value: null

Name: “75:US”

Value: null…

{publisher: 456,index: displays}

Name: “5000:CA”Value: null

Name: “1100:US”Value: null

… … … …

Data Model – IndicesTrivial to answer a “Top N” query for a single row if the field we sort on has an index: just read the last N columns of the index row

What if the query spans multiple rows?

Use 3-pass uniform threshold algorithm. Guaranteed to get the top-N columns in any multi-row aggregate in 3 RPC calls. See: [http://www.cs.ucsb.edu/research/tech_reports/reports/2005-14.pdf]

Has some drawbacks: can’t do bottom-N, computing top-N-to-2N is impossible, have to do top-2N and drop half.

Data Model – Drilldowns

All cities in the world stored in one row, allowing us to do a global sort. What if we need cities within some region only?

Solution: use “drilldown” indices.

Just a special kind of index that includes only a subset of all data in the parent row.

Example: all cities in the country “US”

Works like regular index otherwise

Not free – more than 1/3rd of all our C* disk usage

The Bad Stuff

Read-modify-write is slow, because in C* read latency >> write latency

Having a write-only pipeline would greatly speed up processing, but makes reading data more expensive (aggregate-on-read)

And/or requires more complicated asynchronous aggregation

Minimum granularity of 1 day is not that good, would like to do 1-hour or 1-minute

But, storage requirements go up very fast

The Bad Stuff

Synchronous updates of time rollups and index rows make processing slower and increase delays

But, asynchronous is harder to get right

Reprocessing of data is currently difficult because of lack of locking – have to pause regular pipeline

Also have to reprocess log files in batches of full days

LESSONSLEARNED

PAINFUL

DATA MODELCHANGES

… so design to make them less so

EVERYTHINGWILL

… so test accordingly

SEPARATELOGICALLYDIFFERENT

… it will improve performance AND make your life simpler

PERF TESTWITH

PRODUCTION

… if you can afford a second cluster

http://cassandra.apache.org

http://www.datastax.com/dev

http://www.ooyala.com

THANK YOU

Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

Technology

Transcript of Cassandra nyc 2011 ilya maykov - ooyala - scaling video analytics with apache cassandra

The Future of Online Video - Ooyala Telstra

Ilya Repin

GLOBAL VIDEO INDEX REPORT - OOYALA

SPARK AT OOYALA

John Treloar, Ooyala Australia & New Zealand - Ooyala - Data drives dollars! Live content, the Can’t-Miss Move. Getting the Money Ball Analytics to Fetch Increased Broadcast Rights

Brightcove, Vidyard, Ooyala, Appscend | Company Showdown

Ooyala global-video-index-q2-2012

Sell-Side Platform User Guide - Ooyala Help Centerhelp.ooyala.com/.../en/video-advertising/pdf/Pulse_SSP_User_Guide.pdf · Introduction Ooyala Pulse SSP is Ooyala's sell-side programmatic

Ooyala Global Video Index Q2 2015

OOYALA IS THE SPORTS VIDEO LEADERgo.ooyala.com/rs/OOYALA/images/ooyala_for_sports_12071201.pdf · Electronic Programming Guide (EPG) Ooyala’s EPG provides users a guide that combines

Global Video Index Report by Ooyala

Q2 2012 - Ooyalago.ooyala.com/rs/OOYALA/images/Ooyala-Global-Video-Index... · 2020. 6. 10. · Viewers are embracing a mobile, multi-screen experience in countries around the world.

Ooyala Global Video Index Q4 2015

Company Title G-Technology WW Sr. Product Line Manager ... · Ooyala Senior Account Executive, North America Ooyala Senior Account Executive, Media Logistics Microsoft Senior Account

Smilga Ilya resume

Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Summit 2016

A Message From Ooyalago.ooyala.com/rs/OOYALA/images/Ooyala-Global-Video... · GLOBAL VIDEO INDEX: 2012 YEAR IN REVIEW 2 A Message From Ooyala 2012 was another historic year in online

Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastructure

Ooyala video-index-report-Q3-2011

Are You Video Neanderthal or Video Sapiens? - Ooyalago.ooyala.com/rs/OOYALA/images/Ovum-Ooyala-Video-Neanderthal.pdf · Are You Video Neanderthal or Video Sapiens? ... and the market