Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache SparkEasy and Fast Big Data Analytics

Pat McDonough

Founded by the creators of Apache Sparkout of UC Berkeley’s AMPLab

Fully committed to 100% open source Apache Spark

Support and Grow the Spark Community and Ecosystem

Building Databricks Cloud

Databricks & DatastaxApache Spark is packaged as part of Datastax

Enterprise Analytics 4.5

Databricks & Datstax Have Partnered for Apache Spark Engineering and Support

Big Data AnalyticsWhere We’ve Been

• 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop

• 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others

Big Data AnalyticsA Zoo of Innovation

What's Working?

Many Excellent Innovations Have Come From Big Data Analytics:

• Distributed & Data Parallel is disruptive ... because we needed it

• We Now Have Massive throughput… Solved the ETL Problem

• The Data Hub/Lake Is Possible

What Needs to Improve? Go Beyond MapReduce

MapReduce is a Very Powerful and Flexible Engine

Processing Throughput Previously Unobtainable on

Commodity Equipment

But MapReduce Isn’t Enough:

• Essentially Batch-only

• Inefficient with respect to memory use, latency

• Too Hard to Program

What Needs to Improve? Go Beyond (S)QL

SQL Support Has Been A Welcome Interface on Many

Platforms

And in many cases, a faster alternative

But SQL Is Often Not Enough:

• Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs.

• Machine Learning (see above, plus iterative)

• Multi-step pipelines

• Often an Additional System

What Needs to Improve? Ease of Use

Big Data Distributions Provide a number of Useful Tools and

Systems

Choices are Good to Have

But This Is Often Unsatisfactory:

• Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging

• A typical solution requires stringing together disparate systems - we need unification

• Developers want the full power of their programming language

What Needs to Improve? Latency

Big Data systems are throughput-oriented

Some new SQL Systems provide interactivity

But We Need More:

• Interactivity beyond SQL interfaces

• Repeated access of the same datasets (i.e. caching)

Can Spark Solve These Problems?

Apache SparkOriginally developed in 2009 in UC Berkeley’s

AMPLab

Fully open sourced in 2010 – now at Apache Software Foundation

http://spark.apache.org

http://spark.apache.org

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Compared to Other Projects

0

300

600

900

1200

0

75000

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Compared to Other Projects

0

300

600

900

1200

0

75000

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Spark is now the most active project in the Hadoop ecosystem

Spark on GithubSo active on Github, sometimes we break it

Over 1200 Forks (can’t display Network Graphs)

~80 commits to master each week

So many PRs We Built our own PR UI

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Up to 100× faster (2-10× on disk)

2-5× less code

Apache Spark - A Robust SDK for Big Data Applications

SQL Machine Learning Streaming Graph

Core

Unified System With Libraries to Build a Complete Solution !

Full-featured Programming Environment in Scala, Java, Python…

Very developer-friendly, Functional API for working with Data !

Runtimes available on several platforms

Spark Is A Part Of Most Big Data Platforms

• All Major Hadoop Distributions Include Spark

• Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE

• Spark Applications Can Be Written Once and Deployed Anywhere

SQL Machine Learning Streaming Graph

Core

Deploy Spark Apps Anywhere

Easy: Get Started Immediately

Interactive Shell Multi-language support

Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Easy: Clean API

Resilient Distributed Datasets

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Operations

• Transformations (e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Write programs in terms of transformations on distributed datasets

Easy: Expressive APImap reduce

Easy: Expressive APImap filter groupBy sort union join leftOuterJoin rightOuterJoin

reduce count fold reduceByKey groupByKey cogroup cross zip

sample take first partitionBy mapWith pipe save ...

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Spark

Easy: Works Well With Hadoop

Data Compatibility

• Access your existing Hadoop Data

• Use the same data formats

• Adheres to data locality for efficient processing

!

Deployment Models

• “Standalone” deployment

• YARN-based deployment

• Mesos-based deployment

• Deploy on existing Hadoop cluster or side-by-side

Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache()

!

w = numpy.random.rand(D)

!

for i in range(iterations):

gradient = data

.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient

!

print “Final w: %s” % w

Fast: Using RAM, Operator Graphs

In-memory Caching

• Data Partitions read from RAM instead of disk

Operator Graphs

• Scheduling Optimizations

• Fault Tolerance

= cached partition

= RDD

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

map

Fast: Logistic Regression Performance

Runn

ing

Tim

e (s

)

0

1000

2000

3000

4000

Number of Iterations1 5 10 20 30

Hadoop Spark

110 s / iteration

first iteration 80 s further iterations 1 s

Fast: Scales Down SeamlesslyEx

ecution time (s)

0

25

50

75

100

% of working set in cache

Cache disabled 25% 50% 75% Fully cached

11.5304

29.747140.7407

58.061468.8414

Easy: Fault RecoveryRDDs track lineage information that can be used to

efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDDMapped

RDDfilter(func = startsWith(…))

map(func = split(...))

How Spark Works

Working With RDDs

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”)

Working With RDDs

RDDRDD

RDDRDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)


Working With RDDs

RDDRDD

RDDRDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count() 74 !linesWithSpark.first() # Apache Spark


Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Worker

Worker

Worker

Driver


Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)


lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver


lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()



Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action



Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3



Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Drivertasks

tasks

tasks



Worker

Worker


Block 1

Block 2

Block 3

Driver

Read HDFS Block

Read HDFS Block

Read HDFS Block



Worker

Worker


Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process& Cache Data

Process& Cache Data

Process& Cache Data



Worker

Worker


Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results



Worker

Worker


Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()



Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3


tasks

tasks

tasks

Driver



Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3


Driver

ProcessfromCache

ProcessfromCache

ProcessfromCache



Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3


Driverresults

results

results


Example: Log Mining


Worker

Worker


Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3


Driver

Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk

Load error messages from a log into memory, then interactively search for various patterns

Cassandra + Spark: A Great Combination

Both are Easy to Use

Spark Can Help You Bridge Your Hadoop and Cassandra Systems

Use Spark Libraries, Caching on-top of Cassandra-stored Data

Combine Spark Streaming with Cassandra Storage Datastaxspark-cassandra-connector:https://github.com/datastax/spark-cassandra-connector

https://github.com/datastax/spark-cassandra-connector

Schema RDDs (Spark SQL)• Built-in Mechanism for recognizing Structured data in Spark

• Allow for systems to apply several data access and relational optimizations (e.g. predicate push-down, partition pruning, broadcast joins)

• Columnar in-memory representation when cached

• Native Support for structured formats like parquet, JSON

• Great Compatibility with the Rest of the Stack (python, libraries, etc.)

Thank You!Visit http://databricks.com:Blogs, Tutorials and more

!

Questions?

http://databricks.com

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Technology

Transcript of Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms