Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

61
Apache Spark Easy and Fast Big Data Analytics Pat McDonough

description

Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.

Transcript of Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Page 1: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache SparkEasy and Fast Big Data Analytics

Pat McDonough

Page 2: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Founded by the creators of Apache Sparkout of UC Berkeley’s AMPLab

Fully committed to 100% open source Apache Spark

Support and Grow the Spark Community and Ecosystem

Building Databricks Cloud

Page 3: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Databricks & DatastaxApache Spark is packaged as part of Datastax

Enterprise Analytics 4.5

Databricks & Datstax Have Partnered for Apache Spark Engineering and Support

Page 4: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big Data AnalyticsWhere We’ve Been

• 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop

• 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others

Page 5: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big Data AnalyticsA Zoo of Innovation

Page 6: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big Data AnalyticsA Zoo of Innovation

Page 7: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big Data AnalyticsA Zoo of Innovation

Page 8: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Big Data AnalyticsA Zoo of Innovation

Page 9: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

What's Working?

Many Excellent Innovations Have Come From Big Data Analytics:

• Distributed & Data Parallel is disruptive ... because we needed it

• We Now Have Massive throughput… Solved the ETL Problem

• The Data Hub/Lake Is Possible

Page 10: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

What Needs to Improve? Go Beyond MapReduce

MapReduce is a Very Powerful and Flexible Engine

Processing Throughput Previously Unobtainable on

Commodity Equipment

But MapReduce Isn’t Enough:

• Essentially Batch-only

• Inefficient with respect to memory use, latency

• Too Hard to Program

Page 11: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

What Needs to Improve? Go Beyond (S)QL

SQL Support Has Been A Welcome Interface on Many

Platforms

And in many cases, a faster alternative

But SQL Is Often Not Enough:

• Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs.

• Machine Learning (see above, plus iterative)

• Multi-step pipelines

• Often an Additional System

Page 12: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

What Needs to Improve? Ease of Use

Big Data Distributions Provide a number of Useful Tools and

Systems

Choices are Good to Have

But This Is Often Unsatisfactory:

• Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging

• A typical solution requires stringing together disparate systems - we need unification

• Developers want the full power of their programming language

Page 13: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

What Needs to Improve? Latency

Big Data systems are throughput-oriented

Some new SQL Systems provide interactivity

But We Need More:

• Interactivity beyond SQL interfaces

• Repeated access of the same datasets (i.e. caching)

Page 14: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Can Spark Solve These Problems?

Page 15: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache SparkOriginally developed in 2009 in UC Berkeley’s

AMPLab

Fully open sourced in 2010 – now at Apache Software Foundation

http://spark.apache.org

Page 16: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Page 17: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Page 18: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Compared to Other Projects

0

300

600

900

1200

0

75000

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Page 19: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Compared to Other Projects

0

300

600

900

1200

0

75000

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Spark is now the most active project in the Hadoop ecosystem

Page 20: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Spark on GithubSo active on Github, sometimes we break it

Over 1200 Forks (can’t display Network Graphs)

~80 commits to master each week

So many PRs We Built our own PR UI

Page 21: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Page 22: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Up to 100× faster (2-10× on disk)

2-5× less code

Page 23: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache Spark - A Robust SDK for Big Data Applications

SQL Machine Learning Streaming Graph

Core

Unified System With Libraries to Build a Complete Solution !

Full-featured Programming Environment in Scala, Java, Python…

Very developer-friendly, Functional API for working with Data !

Runtimes available on several platforms

Page 24: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Spark Is A Part Of Most Big Data Platforms

• All Major Hadoop Distributions Include Spark

• Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE

• Spark Applications Can Be Written Once and Deployed Anywhere

SQL Machine Learning Streaming Graph

Core

Deploy Spark Apps Anywhere

Page 25: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Get Started Immediately

Interactive Shell Multi-language support

Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Page 26: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Clean API

Resilient Distributed Datasets

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Operations

• Transformations (e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Write programs in terms of transformations on distributed datasets

Page 27: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Expressive APImap reduce

Page 28: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Expressive APImap filter groupBy sort union join leftOuterJoin rightOuterJoin

reduce count fold reduceByKey groupByKey cogroup cross zip

sample take first partitionBy mapWith pipe save ...

Page 29: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Spark

Page 30: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Spark

Page 31: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Works Well With Hadoop

Data Compatibility

• Access your existing Hadoop Data

• Use the same data formats

• Adheres to data locality for efficient processing

!

Deployment Models

• “Standalone” deployment

• YARN-based deployment

• Mesos-based deployment

• Deploy on existing Hadoop cluster or side-by-side

Page 32: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache()

!

w = numpy.random.rand(D)

!

for i in range(iterations):

gradient = data

.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient

!

print “Final w: %s” % w

Page 33: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Fast: Using RAM, Operator Graphs

In-memory Caching

• Data Partitions read from RAM instead of disk

Operator Graphs

• Scheduling Optimizations

• Fault Tolerance

=  cached  partition

=  RDD

join

filter

groupBy

Stage  3

Stage  1

Stage  2

A: B:

C: D: E:

F:

map

Page 34: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Fast: Logistic Regression Performance

Runn

ing

Tim

e (s

)

0

1000

2000

3000

4000

Number of Iterations1 5 10 20 30

Hadoop Spark

110  s  /  iteration

first  iteration  80  s  further  iterations  1  s

Page 35: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Fast: Scales Down SeamlesslyEx

ecution  time  (s)

0

25

50

75

100

%  of  working  set  in  cache

Cache  disabled 25% 50% 75% Fully  cached

11.5304

29.747140.7407

58.061468.8414

Page 36: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Easy: Fault RecoveryRDDs track lineage information that can be used to

efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDDMapped

RDDfilter(func  =  startsWith(…))

map(func  =  split(...))

Page 37: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

How Spark Works

Page 38: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Working With RDDs

Page 39: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”)

Page 40: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Working With RDDs

RDDRDD

RDDRDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

textFile = sc.textFile(”SomeFile.txt”)

Page 41: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Working With RDDs

RDDRDD

RDDRDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count() 74 !linesWithSpark.first() # Apache Spark

textFile = sc.textFile(”SomeFile.txt”)

Page 42: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 43: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 44: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 45: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 46: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 47: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 48: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 49: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 50: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Drivertasks

tasks

tasks

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 51: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Read HDFS Block

Read HDFS Block

Read HDFS Block

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 52: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process& Cache Data

Process& Cache Data

Process& Cache Data

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 53: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 54: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 55: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

tasks

tasks

tasks

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 56: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

ProcessfromCache

ProcessfromCache

ProcessfromCache

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 57: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driverresults

results

results

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 58: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Example: Log Mining

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk

Load error messages from a log into memory, then interactively search for various patterns

Page 59: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Cassandra + Spark: A Great Combination

Both are Easy to Use

Spark Can Help You Bridge Your Hadoop and Cassandra Systems

Use Spark Libraries, Caching on-top of Cassandra-stored Data

Combine Spark Streaming with Cassandra Storage Datastaxspark-cassandra-connector:https://github.com/datastax/spark-cassandra-connector

Page 60: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Schema RDDs (Spark SQL)• Built-in Mechanism for recognizing Structured data in Spark

• Allow for systems to apply several data access and relational optimizations (e.g. predicate push-down, partition pruning, broadcast joins)

• Columnar in-memory representation when cached

• Native Support for structured formats like parquet, JSON

• Great Compatibility with the Rest of the Stack (python, libraries, etc.)

Page 61: Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Thank You!Visit http://databricks.com:Blogs, Tutorials and more

!

Questions?