Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

44
Real-time Analytics with Cassandra, Spark and Shark Saturday, December 7, 13

description

What You Will Learn At This Meetup: • Review of Cassandra analytics landscape: Hadoop & HIVE • Custom input formats to extract data from Cassandra • How Spark & Shark increase query speed & productivity over standard solutions Abstract This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today. About Evan Chan Evan Chan is a Software Engineer at Ooyala. In his own words: I love to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. I am a big believer in GitHub, open source, and meetups, and have given talks at conferences such as the Cassandra Summit 2013. South Bay Cassandra Meetup URL: http://www.meetup.com/DataStax-Cassandra-South-Bay-Users/events/147443722/

Transcript of Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Page 1: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Real-time Analytics withCassandra, Spark and Shark

Saturday, December 7, 13

Page 2: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Who is this guy• Staff Engineer, Compute and Data Services, Ooyala• Building multiple web-scale real-time systems on top of C*, Kafka,

Storm, etc.• Scala/Akka guy• github.com/velvia• @evanfchan

Saturday, December 7, 13

Page 3: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Agenda• Ooyala and Cassandra• What problem are we trying to solve?• Spark and Shark• Our Spark/Cassandra Architecture• Demo

Saturday, December 7, 13

Page 4: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Cassandra at OoyalaWho is Ooyala, and how we use Cassandra

Saturday, December 7, 13

Page 5: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

CONFIDENTIAL—DO NOT DISTRIBUTE

OOYALAPowering personalized video

experiences across all screens.

5

Saturday, December 7, 13

Page 6: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTE

Founded in 2007

Commercially launch in 2009

300 employees in Silicon Valley, LA, NYC, London, Paris, Tokyo, Sydney & Guadalajara

Global footprint, 200M unique users,110+ countries, and more than 6,000 websites

Over 1 billion videos played per month and 2 billion analytic events per day

25% of U.S. online viewers watch video powered by Ooyala

COMPANY OVERVIEW

Saturday, December 7, 13

Page 7: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

CONFIDENTIAL—DO NOT DISTRIBUTE 7

TRUSTED VIDEO PARTNER

STRATEGIC PARTNERS

CUSTOMERS

CONFIDENTIAL—DO NOT DISTRIBUTE

Saturday, December 7, 13

Page 8: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

We are a large Cassandra user• 12 clusters ranging in size from 3 to 107 nodes• Total of 28TB of data managed over ~220 nodes• Over 2 billion C* column writes per day• Powers all of our analytics infrastructure• DSE/C* 1.0.x, 1.1.x, 1.2.6• Large prod cluster is one of the biggest Cassandra

installations

Saturday, December 7, 13

Page 9: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

What problem are we trying to solve?Lots of data, complex queries, answered really quickly... but how??

Saturday, December 7, 13

Page 10: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

From mountains of raw data...

Saturday, December 7, 13

Page 11: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

To nuggets of truth...

•Quickly•Painlessly•At scale?

Saturday, December 7, 13

Page 12: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Today: Precomputed aggregates• Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change• Most computed aggregates are never read• What if we need more dynamic queries?

– Top content for mobile users in France– Engagement curves for users who watched recommendations– Data mining, trends, machine learning

Saturday, December 7, 13

Page 13: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

The static - dynamic continuum

• Super fast lookups• Inflexible, wasteful• Best for 80% most

common queries

• Always compute results from raw data

• Flexible but slow

100% Precomputation 100% Dynamic

Saturday, December 7, 13

Page 14: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Where we want to be

Partly dynamic

• Pre-aggregate most common queries

• Flexible, fast dynamic queries

• Easily generate many materialized views

Saturday, December 7, 13

Page 15: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Industry Trends• Fast execution frameworks

– Impala, Drill, Presto• In-memory databases

– VoltDB, Druid• Streaming and real-time• Higher-level, productive data frameworks

– Cascading, Hive, Pig

Saturday, December 7, 13

Page 16: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Why Spark and Shark?“Lightning-fast in-memory cluster computing”

Saturday, December 7, 13

Page 17: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Introduction to Spark

• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010• Targeted problems that MR is bad at:

– Iterative algorithms (machine learning)– Interactive data mining

• More general purpose than Hadoop MR• Active contributions from ~ 15 companies

Saturday, December 7, 13

Page 18: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

HDFS

Map

Reduce

Map

Reduce

Data Source

map()

join()

Source 2

cache()

transform

Saturday, December 7, 13

Page 19: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Throughput: Memory is king

0 37500 75000 112500 150000C*, cold cache

C*, warm cache

Spark RDD

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Saturday, December 7, 13

Page 20: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Developers love it

• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft• Full power of Scala, Java, Python• Interactive REPL shell• EASY testing!!• Low latency - quick development cycles

Saturday, December 7, 13

Page 21: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Spark word count example

file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" "))    .map(word => (word, 1))    .reduceByKey(_ + _)

1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }

Saturday, December 7, 13

Page 22: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

The Spark Ecosystem

Bagel - Pregel on Spark HIVE on Spark

Spark Streaming - discretized stream processing

Spark

Tachyon - in-memory caching DFS

Saturday, December 7, 13

Page 23: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Shark - HIVE on Spark

• 100% HiveQL compatible• 10-100x faster than HIVE, answers in seconds• Reuse UDFs, SerDe’s, StorageHandlers• Can use DSE / CassandraFS for Metastore• Easy Scala/Java integration via Spark - easier than

writing UDFs

Saturday, December 7, 13

Page 24: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Our new analytics architectureHow we integrate Cassandra and Spark/Shark

Saturday, December 7, 13

Page 25: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

From raw events to fast queries

IngestionC*

event store

Raw Events

Raw Events

Raw Events Spark

Spark

Spark

View 1

View 2

View 3

Spark

Shark

Predefined queries

Ad-hoc HiveQL

Saturday, December 7, 13

Page 26: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Our Spark/Shark/Cassandra Stack

Node1

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Node2

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Node3

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Spark Master Job Server

Saturday, December 7, 13

Page 27: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Event Store Cassandra schema

t0 t1 t2 t3 t4

2013-04-05T00:00Z#id1

{event0: a0}

{event1: a1}

{event2: a2}

{event3: a3}

{event4: a4}

ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:500:t0

2013-04-05T00:00Z#id1

Event CF

EventAttr CF

Saturday, December 7, 13

Page 28: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Unpacking raw eventst0 t1

2013-04-05T00:00Z#id1

{video: 10, type:5}

{video: 11, type:1}

2013-04-05T00:00Z#id2

{video: 20, type:5}

{video: 25, type:9}

UserID Video Typeid1 10 5

Saturday, December 7, 13

Page 29: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Unpacking raw eventst0 t1

2013-04-05T00:00Z#id1

{video: 10, type:5}

{video: 11, type:1}

2013-04-05T00:00Z#id2

{video: 20, type:5}

{video: 25, type:9}

UserID Video Typeid1 10 5

id1 11 1

Saturday, December 7, 13

Page 30: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Unpacking raw eventst0 t1

2013-04-05T00:00Z#id1

{video: 10, type:5}

{video: 11, type:1}

2013-04-05T00:00Z#id2

{video: 20, type:5}

{video: 25, type:9}

UserID Video Typeid1 10 5

id1 11 1

id2 20 5

Saturday, December 7, 13

Page 31: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Unpacking raw eventst0 t1

2013-04-05T00:00Z#id1

{video: 10, type:5}

{video: 11, type:1}

2013-04-05T00:00Z#id2

{video: 20, type:5}

{video: 25, type:9}

UserID Video Typeid1 10 5

id1 11 1

id2 20 5

id2 25 9

Saturday, December 7, 13

Page 32: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Options for Spark/Cassandra Integration• Hadoop InputFormat

– ColumnFamilyInputFormat - reads all rows from 1 CF– CqlPagingInputFormat, etc. - CQL3, 2-dary indexes– Roll your own (join multiple CFs, etc)

• Spark native RDD– sc.parallelize(rowkeys).flatMap(readColumns(_))

– JdbcRdd + Cassandra JDBC driver• http://tuplejump.github.io/calliope/

Saturday, December 7, 13

Page 33: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Tips for InputFormat Development• Know which target platforms you are developing for

– Which API to write against? New? Old? Both?• Be prepared to spend time tuning your split computation

– Low latency jobs require fast splits• Consider sorting row keys by token for data locality• Implement predicate pushdown for HIVE SerDe’s

– Use your indexes to reduce size of dataset

Saturday, December 7, 13

Page 34: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Example: OLAP processing

t0

2013-04-05T00:00Z#id1

{video: 10, type:5}

2013-04-05T00:00Z#id2

{video: 20, type:5}

C* events

OLAP Aggregates

OLAP Aggregates

OLAP Aggregates

Cached Materialized Views

Spark

Spark

Spark

Union

Query 1: Plays by Provider

Query 2: Top content for mobile

Saturday, December 7, 13

Page 35: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Performance numbers

Spark: C* -> OLAP aggregatescold cache, 1.4 million events

130 seconds

C* -> OLAP aggregateswarmed cache

20-30 seconds

OLAP aggregate query via Spark(56k records)

60 ms

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Saturday, December 7, 13

Page 36: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

OLAP WorkFlow

DatasetAggregation Job Query JobSparkExecutors

Cassandra

REST Job Server

Query Job

Aggregate Query

Result

Query

Result

Saturday, December 7, 13

Page 37: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?• Spark lineage - automatic recomputation from source, but this is

expensive!• Can also replicate cached dataset to survive single node failures• Persist materialized views back to C*, then load into cache -- now

recovery path is much faster• Persistence also enables multiple processes to hold cached dataset

Saturday, December 7, 13

Page 38: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Demo time

Saturday, December 7, 13

Page 39: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Shark Demo• Local shark node, 1 core, MBP• How to create a table from C* using our inputformat• Creating a cached Shark table• Running fast queries

Saturday, December 7, 13

Page 40: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Creating a Shark Table from InputFormat

Saturday, December 7, 13

Page 41: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Creating a cached table

Saturday, December 7, 13

Page 42: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Querying cached table

Saturday, December 7, 13

Page 43: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

THANK YOU• @evanfchan• [email protected]

• WE ARE HIRING!!

Saturday, December 7, 13

Page 44: Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Spark: Under the hood

Map DatasetReduce Map

Driver Map DatasetReduce Map

Map DatasetReduce Map

One executor process per node

Driver

Saturday, December 7, 13