Conviva spark

How Conviva used Spark to Speed Up Video Analytics by 30x Dil ip Antony Joseph (@Dil ipAntony)

Conviva monitors and optimizes online video for premium content providers

What happens if you don’t use Conviva?

65 million video streams a day

Conviva data processing architecture

Video Player

Hadoop for historical data

Live data processing

Monitoring Dashboard

Reports

Ad-‐hoc analysis

Group By queries dominate our Reporting workload

10s of metrics, 10s of group bys

SELECT videoName, COUNT(1) FROM summaries WHERE date='2012_08_22' AND customer='XYZ’ GROUP BY videoName;

Group By query in Spark

val sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest } val cachedSessions = sessions.filter( whereCondifonToFilterSessionsForTheDesiredDay) .cache val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) } val reduceFn : (Long, Long) => Long = { (a,b) => a+b } val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap

Spark is 30x faster than Hive 45 minutes versus 24 hours

for weekly Conviva Geo Report

How much data are we talking about?

•  150 GB / week of compressed summary data

•  Compressed ~ 3:1

•  Stored as Hadoop Sequence Files •  Custom binary serializafon format

Spark is faster because it avoids reading data from disk multiple times

Read from HDFS Decompress Deserialize

Group By Country

Group By State

Group By Video

10s of Group Bys …

Group By Country Read from HDFS Decompress Deserialize Cache data in memory

Read data from memory

Group By State

Group By Video

10s of Group Bys …

Hive/MapReduce startup overhead

Cache only columns of interest

Overhead of flushing

intermediate data to disk

Why not <some other big data system>?

•  Hive

•  Mysql

•  Oracle/SAP HANA

•  Column oriented dbs

Spark just worked for us

•  30x speed-‐up •  Great fit for our report computafon model •  Open-‐source •  Flexible •  Scalable

30% of our reports use Spark

We are working on more projects that use Spark

•  Streaming Spark for unifying batch and live computafon

•  SHadoop – Run exisfng Hadoop jobs on Spark

Problems with Spark

Reports

Ad-‐hoc analysis

Video Player

No Logo

Reports

Ad-‐hoc analysis

Video Player

Spark queries are not very succinct SELECT videoName, COUNT(1) FROM summaries WHERE date='2012_08_22' AND customer='XYZ’ GROUP BY videoName;

val sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest } val cachedSessions = sessions.filter( whereCondifonToFilterSessionsForTheDesiredDay) .cache val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) } val reduceFn : (Long, Long) => Long = { (a,b) => a+b } val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap

There is a learning curve associated with Scala, but ...

•  Type-‐safety offered by Scala is a great boon –  Code complefon via Eclipse Scala plugin

•  Complex queries are easier in Scala than in Hive –  Cascading IF()s in Hive

•  No need to write Scala in the future –  Shark –  Java, Python APIs

Additional Hiccups

•  Always on the bleeding edge – geung dependencies right

•  More maintenance/debugging tools required

Spark has been working great for Conviva for over a year

We are Hiring jobs@conviva.com

hwp://www.conviva.com/blog/engineering/using-‐spark-‐and-‐hive-‐to-‐process-‐bigdata-‐at-‐conviva

Conviva spark

Technology

Transcript of Conviva spark

Spark Infrastructure - Australian Securities Exchange · Spark Infrastructure represents Spark Infrastructure Trust and its consolidated entities. Spark Infrastructure RE Limited

SIGCOMM13 Tutorial: Hui Zhang - Conviva Confidential - Hui Zhang Vyas Sekar Ion Stoica Internet Video Tutorial.

Spark summit2014 techtalk - testing spark

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel

[Spark meetup] Spark Streaming Overview

REPLACEMENT SPARK PLUGS Spark Plug Application Chart · REPLACEMENT SPARK PLUGS Spark Plug Application Chart ... EC Series Air-Cooled 1 ... REPLACEMENT SPARK PLUGS Spark Plug Application

Conviva Adobe WP FINAL

Using Spark @ Conviva Spark Summit 2013. Summary Who are we? What is the problem we needed to solve? How was Spark essential to the solution? What can.

Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide5xracing.com/...spark-plug-and-spark-plug-wire-installation-guide.pdf · Mazda RX-8 Spark Plug and Spark Plug Wire Install Guide

How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

Learning spark ch10 - Spark Streaming

Spark streaming , Spark SQL

Bosch Industrial Spark Plugs: The Right Spark Plug for ... · Bosch Industrial Spark Plugs: The Right Spark Plug ... series spark plugs dissipates ... Introducing the new Double Ir

IBM Spark Meetup - RDD & Spark Basics

Using Spark @ Conviva 29 Aug 2013

Junchen Jiang (CMU) Vyas Sekar (Stony Brook U) Hui Zhang (CMU/ Conviva Inc . )

Spark Plugs - · PDF fileSpark Plugs Catalogue: Issue 3.2. AC-DELCO ... Spark Plug Specifications ACDelco Regular Spark Plugs Spark Plug ... SPARK PLUGS. An ideal plug for vehicles

McDonough Spark Tutorial Spark Summit 2013

Spark Platform Spark Core Spark Extensions Using … Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks 20 years

Spark, spark streaming & tachyon