Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

34
Date Spark Job Server Evan Chan and Kelvin Chu

description

This was a talk that Kelvin Chu and I just gave at the SF Bay Area Spark Meetup 5/14 at Palantir Technologies. We discussed the Spark Job Server (http://github.com/ooyala/spark-jobserver), its history, example workflows, architecture, and exciting future plans to provide HA spark job contexts. We also discussed the use case of the job server at Ooyala to facilitate fast query jobs using shared RDD and a shared job context, and how we integrate with Apache Cassandra.

Transcript of Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Page 1: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Date

Spark Job ServerEvan Chan and Kelvin Chu

Page 2: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Overview

• REST API for Spark jobs and contexts. Easily operate Spark from any language or environment.

• Runs jobs in their own Contexts or share 1 context amongst jobs

• Great for sharing cached RDDs across jobs and low-latency jobs

• Works with Standalone, Mesos, any Spark config

• Jars, job history and config are persisted via a pluggable API

• Async and sync API, JSON job results

Page 3: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

http://github.com/ooyala/spark-jobserver

Open Source!!

Page 4: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

History

Page 5: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

CONFIDENTIAL—DO NOT DISTRIBUTE 5

Founded in 2007

Commercially launched in 2009

300+ employees in Silicon Valley, LA, NYC, London, Paris, Tokyo, Sydney & Guadalajara

Global footprint, 200M unique users,110+ countries, and more than 6,000 websites

Over 1 billion videos played per month and 2 billion analytic events per day

25% of U.S. online viewers watch video powered by Ooyala

Ooyala, Inc.

Page 6: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Spark at Ooyala

• Started investing in Spark beginning of 2013

• Developers loved it, promise of a unifying platform

• 2 teams of developers building on Spark

• Actively contributing to the Spark community

• Largest Spark cluster has > 100 nodes

• Spark community very active, huge amount of interest

Page 7: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

From raw logs to fast queries

ProcessingC*

columnar store

Raw Log Files

Raw Log Files

Raw Log Files Spark

Spark

Spark

View 1

View 2

View 3

Spark

Shark

Predefined queries

Ad-hoc HiveQL

Page 8: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Our Spark/Shark/Cassandra Stack

Node1

Cassandra

SerDe

Spark Worker

Shark

Node2

Cassandra

SerDe

Spark Worker

Shark

Node3

Cassandra

SerDe

Spark Worker

Shark

Spark Master Job Server

Page 9: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Why We Needed a Job Server

• Our vision for Spark is as a multi-team big data service

• What gets repeated by every team:

• Bastion box for running Hadoop/Spark jobs

• Deploys and process monitoring

• Tracking and serializing job status, progress, and job results

• Job validation

• No easy way to kill jobs

• Polyglot technology stack - Ruby scripts run jobs, Go services

Page 10: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Example Workflow

Page 11: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Creating a Job Server Project

✤ sbt assembly -> fat jar -> upload to job server!

✤ "provided" is used. Don’t want SBT assembly to include the whole job server jar.!

✤ Java projects should be possible too

resolvers += "Ooyala Bintray" at "http://dl.bintray.com/ooyala/maven" !libraryDependencies += "ooyala.cnd" % "job-server" % "0.3.1" % "provided"

✤ In your build.sbt, add this

Page 12: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Example Job Server Job

/**! * A super-simple Spark job example that implements the SparkJob trait and! * can be submitted to the job server.! */!object WordCountExample extends SparkJob {! override def validate(sc: SparkContext, config: Config): SparkJobValidation = {! Try(config.getString(“input.string”))! .map(x => SparkJobValid)! .getOrElse(SparkJobInvalid(“No input.string”))! }!! override def runJob(sc: SparkContext, config: Config): Any = {! val dd = sc.parallelize(config.getString(“input.string”).split(" ").toSeq)! dd.map((_, 1)).reduceByKey(_ + _).collect().toMap! }!}!

Page 13: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

What’s Different?

• Job does not create Context, Job Server does

• Decide when I run the job: in own context, or in pre-created context

• Upload new jobs to diagnose your RDD issues:

• POST /contexts/newContext

• POST /jobs .... context=newContext

• Upload a new diagnostic jar... POST /jars/newDiag

• Run diagnostic jar to dump into on cached RDDs

Page 14: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Submitting and Running a Job

✦ curl --data-binary @../target/mydemo.jar localhost:8090/jars/demo OK[11:32 PM] ~ !✦ curl -d "input.string = A lazy dog jumped mean dog" 'localhost:8090/jobs?appName=demo&classPath=WordCountExample&sync=true' { "status": "OK", "RESULT": { "lazy": 1, "jumped": 1, "A": 1, "mean": 1, "dog": 2 } }

Page 15: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Retrieve Job Statuses

~/s/jobserver (evan-working-1 ↩=) curl 'localhost:8090/jobs?limit=2' [{ "duration": "77.744 secs", "classPath": "ooyala.cnd.CreateMaterializedView", "startTime": "2013-11-26T20:13:09.071Z", "context": "8b7059dd-ooyala.cnd.CreateMaterializedView", "status": "FINISHED", "jobId": "9982f961-aaaa-4195-88c2-962eae9b08d9" }, { "duration": "58.067 secs", "classPath": "ooyala.cnd.CreateMaterializedView", "startTime": "2013-11-26T20:22:03.257Z", "context": "d0a5ebdc-ooyala.cnd.CreateMaterializedView", "status": "FINISHED", "jobId": "e9317383-6a67-41c4-8291-9c140b6d8459" }]⏎

Page 16: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Use Case: Fast Query Jobs

Page 17: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Spark as a Query Engine

✤ Goal: spark jobs that run in under a second and answers queries on shared RDD data!

✤ Query params passed in as job config!

✤ Need to minimize context creation overhead!

✤ Thus many jobs sharing the same SparkContext!

✤ On-heap RDD caching means no serialization loss!

✤ Need to consider concurrent jobs (fair scheduling)

Page 18: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

LOW-LATENCY QUERY JOBS

RDDLoad Data Query JobSpark

Executors

Cassandra

REST Job Server

Query Job

Query

Result

Query

Result

new SparkContext

Create query

context

Load some data

Page 19: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Sharing Data Between Jobs

✤ RDD Caching!

✤ Benefit: no need to serialize data. Especially useful for indexes etc.!

✤ Job server provides a NamedRdds trait for thread-safe CRUD of cached RDDs by name!

✤ (Compare to SparkContext’s API which uses an integer ID and is not thread safe)!

✤ For example, at Ooyala a number of fields are multiplexed into the RDD name: timestamp:customerID:granularity

Page 20: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Data Concurrency

✤ Single writer, multiple readers!

✤ Managing multiple updates to RDDs!

✤ Cache keeps track of which RDDs being updated!

✤ Example: thread A spark job creates RDD “A” at t0!

✤ thread B fetches RDD “A” at t1 > t0!

✤ Both threads A and B, using NamedRdds, will get the RDD at time t2 when thread A finishes creating the RDD “A”

Page 21: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Using Tachyon

Pros Cons

Off-heap storage: No GC ByteBuffer API - need to pay deserialization cost

Can be shared across multiple processes

Data can survive process loss

Backed by HDFS Does not support random access writes

Page 22: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Architecture

Page 23: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Completely Async Design

✤ http://spray.io - probably the fastest JVM HTTP microframework!

✤ Akka Actor based, non blocking!

✤ Futures used to manage individual jobs. (Note that Spark is using Scala futures to manage job stages now)!

✤ Single JVM for now, but easy to distribute later via remote Actors / Akka Cluster

Page 24: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Async Actor Flow

Spray web API

Request actor

Local Supervisor

Job Manager

Job 1 Future

Job 2 Future

Job Status Actor

Job Result Actor

Page 25: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Message flow fully documented

Page 26: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Production Usage

Page 27: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Metadata Store

✤ JarInfo, JobInfo, ConfigInfo!

✤ JobSqlDAO. Store metadata to SQL database by JDBC interface.!

✤ Easily configured by spark.sqldao.jdbc.url!

✤ jdbc:mysql://dbserver:3306/jobserverdb

✤ Multiple Job Servers can share the same MySQL.!

✤ Jars uploaded once but accessible by all servers.!

✤ The default will be JobSqlDAO and H2.!

✤ Single H2 DB file. Serialization and deserialization are handled by H2.

Page 28: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Deployment and Metrics

✤ spark-jobserver repo comes with a full suite of tests and deploy scripts:!

✤ server_deploy.sh for regular server pushes!

✤ server_package.sh for Mesos and Chronos .tar.gz!

✤ /metricz route for codahale-metrics monitoring!

✤ /healthz route for health check0o

Page 29: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Challenges and Lessons

• Spark is based around contexts - we need a Job Server oriented around logical jobs

• Running multiple SparkContexts in the same process

• Global use of System properties makes it impossible to start multiple contexts at same time (but see pull request...)

• Have to be careful with SparkEnv

• Dynamic jar and class loading is tricky

• Manage threads carefully - each context uses lots of threads

Page 30: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Future Work

Page 31: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Future Plans

✤ Spark-contrib project list. So this and other projects can gain visibility! (SPARK-1283)!

✤ HA mode using Akka Cluster or Mesos!

✤ HA and Hot Failover for Spark Drivers/Contexts!

✤ REST API for job progress!

✤ Swagger API documentation

Page 32: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

HA and Hot Failover for Jobs

Job Server 1

Job Server 2

Active Job

Context

HDFS

Standby Job

Context

Gossip

Checkpoint

✤ Job context dies:!

✤ Job server 2 notices and spins up standby context, restores checkpoint

Page 33: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Thanks for your contributions!

✤ All of these were community contributed:!

✤ index.html main page!

✤ saving and retrieving job configuration!

✤ Your contributions are very welcome on Github!

Page 34: Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)

Thank you!

And Everybody is Hiring!!