Download - Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Transcript
Page 1: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache SparkEasy and Fast Big Data Analytics

Pat McDonough

Page 2: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Founded by the creators of Apache Sparkout of UC Berkeley’s AMPLab

Fully committed to 100% open source Apache Spark

Support and Grow the Spark Community and Ecosystem

Building Databricks Cloud

Page 3: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Databricks & DatastaxApache Spark is packaged as part of Datastax

Enterprise Analytics 4.5

Databricks & Datstax Have Partnered for Apache Spark Engineering and Support

Page 4: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big Data AnalyticsWhere We’ve Been

• 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop

• 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others

Page 5: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big Data AnalyticsA Zoo of Innovation

Page 6: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big Data AnalyticsA Zoo of Innovation

Page 7: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big Data AnalyticsA Zoo of Innovation

Page 8: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big Data AnalyticsA Zoo of Innovation

Page 9: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

What's Working?

Many Excellent Innovations Have Come From Big Data Analytics:

• Distributed & Data Parallel is disruptive ... because we needed it

• We Now Have Massive throughput… Solved the ETL Problem

• The Data Hub/Lake Is Possible

Page 10: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

What Needs to Improve? Go Beyond MapReduce

MapReduce is a Very Powerful and Flexible Engine

Processing Throughput Previously Unobtainable on

Commodity Equipment

But MapReduce Isn’t Enough:

• Essentially Batch-only

• Inefficient with respect to memory use, latency

• Too Hard to Program

Page 11: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

What Needs to Improve? Go Beyond (S)QL

SQL Support Has Been A Welcome Interface on Many

Platforms

And in many cases, a faster alternative

But SQL Is Often Not Enough:

• Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs.

• Machine Learning (see above, plus iterative)

• Multi-step pipelines

• Often an Additional System

Page 12: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

What Needs to Improve? Ease of Use

Big Data Distributions Provide a number of Useful Tools and

Systems

Choices are Good to Have

But This Is Often Unsatisfactory:

• Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging

• A typical solution requires stringing together disparate systems - we need unification

• Developers want the full power of their programming language

Page 13: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

What Needs to Improve? Latency

Big Data systems are throughput-oriented

Some new SQL Systems provide interactivity

But We Need More:

• Interactivity beyond SQL interfaces

• Repeated access of the same datasets (i.e. caching)

Page 14: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Can Spark Solve These Problems?

Page 15: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache SparkOriginally developed in 2009 in UC Berkeley’s

AMPLab

Fully open sourced in 2010 – now at Apache Software Foundation

http://spark.apache.org

Page 16: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Page 17: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Page 18: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Compared to Other Projects

0

300

600

900

1200

0

75000

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Page 19: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Compared to Other Projects

0

300

600

900

1200

0

75000

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Spark is now the most active project in the Hadoop ecosystem

Page 20: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Spark on GithubSo active on Github, sometimes we break it

Over 1200 Forks (can’t display Network Graphs)

~80 commits to master each week

So many PRs We Built our own PR UI

Page 21: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Page 22: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Up to 100× faster (2-10× on disk)

2-5× less code

Page 23: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache Spark - A Robust SDK for Big Data Applications

SQL Machine Learning Streaming Graph

Core

Unified System With Libraries to Build a Complete Solution !

Full-featured Programming Environment in Scala, Java, Python…

Very developer-friendly, Functional API for working with Data !

Runtimes available on several platforms

Page 24: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Spark Is A Part Of Most Big Data Platforms

• All Major Hadoop Distributions Include Spark

• Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE

• Spark Applications Can Be Written Once and Deployed Anywhere

SQL Machine Learning Streaming Graph

Core

Deploy Spark Apps Anywhere

Page 25: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Get Started Immediately

Interactive Shell Multi-language support

Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Page 26: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Clean API

Resilient Distributed Datasets

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Operations

• Transformations (e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Write programs in terms of transformations on distributed datasets

Page 27: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Expressive APImap reduce

Page 28: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Expressive APImap filter groupBy sort union join leftOuterJoin rightOuterJoin

reduce count fold reduceByKey groupByKey cogroup cross zip

sample take first partitionBy mapWith pipe save ...

Page 29: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Spark

Page 30: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Spark

Page 31: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Works Well With Hadoop

Data Compatibility

• Access your existing Hadoop Data

• Use the same data formats

• Adheres to data locality for efficient processing

!

Deployment Models

• “Standalone” deployment

• YARN-based deployment

• Mesos-based deployment

• Deploy on existing Hadoop cluster or side-by-side

Page 32: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache()

!

w = numpy.random.rand(D)

!

for i in range(iterations):

gradient = data

.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient

!

print “Final w: %s” % w

Page 33: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Fast: Using RAM, Operator Graphs

In-memory Caching

• Data Partitions read from RAM instead of disk

Operator Graphs

• Scheduling Optimizations

• Fault Tolerance

=  cached  partition

=  RDD

join

filter

groupBy

Stage  3

Stage  1

Stage  2

A: B:

C: D: E:

F:

map

Page 34: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Fast: Logistic Regression Performance

Runn

ing

Tim

e (s

)

0

1000

2000

3000

4000

Number of Iterations1 5 10 20 30

Hadoop Spark

110  s  /  iteration

first  iteration  80  s  further  iterations  1  s

Page 35: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Fast: Scales Down SeamlesslyEx

ecution  time  (s)

0

25

50

75

100

%  of  working  set  in  cache

Cache  disabled 25% 50% 75% Fully  cached

11.5304

29.747140.7407

58.061468.8414

Page 36: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Fault RecoveryRDDs track lineage information that can be used to

efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDDMapped

RDDfilter(func  =  startsWith(…))

map(func  =  split(...))

Page 37: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

How Spark Works

Page 38: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Working With RDDs

Page 39: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”)

Page 40: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Working With RDDs

RDDRDD

RDDRDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

textFile = sc.textFile(”SomeFile.txt”)

Page 41: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Working With RDDs

RDDRDD

RDDRDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count() 74 !linesWithSpark.first() # Apache Spark

textFile = sc.textFile(”SomeFile.txt”)

Page 42: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 43: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 44: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 45: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 46: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 47: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 48: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 49: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 50: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Drivertasks

tasks

tasks

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 51: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Read HDFS Block

Read HDFS Block

Read HDFS Block

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 52: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process& Cache Data

Process& Cache Data

Process& Cache Data

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 53: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 54: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 55: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

tasks

tasks

tasks

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 56: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

ProcessfromCache

ProcessfromCache

ProcessfromCache

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 57: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driverresults

results

results

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 58: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Example: Log Mining

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk

Load error messages from a log into memory, then interactively search for various patterns

Page 59: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Cassandra + Spark: A Great Combination

Both are Easy to Use

Spark Can Help You Bridge Your Hadoop and Cassandra Systems

Use Spark Libraries, Caching on-top of Cassandra-stored Data

Combine Spark Streaming with Cassandra Storage Datastaxspark-cassandra-connector:https://github.com/datastax/spark-cassandra-connector

Page 60: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Schema RDDs (Spark SQL)• Built-in Mechanism for recognizing Structured data in Spark

• Allow for systems to apply several data access and relational optimizations (e.g. predicate push-down, partition pruning, broadcast joins)

• Columnar in-memory representation when cached

• Native Support for structured formats like parquet, JSON

• Great Compatibility with the Rest of the Stack (python, libraries, etc.)

Page 61: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Thank You!Visit http://databricks.com:Blogs, Tutorials and more

!

Questions?