Apache SparkEasy and Fast Big Data Analytics
Pat McDonough
Founded by the creators of Apache Sparkout of UC Berkeley’s AMPLab
Fully committed to 100% open source Apache Spark
Support and Grow the Spark Community and Ecosystem
Building Databricks Cloud
Databricks & DatastaxApache Spark is packaged as part of Datastax
Enterprise Analytics 4.5
Databricks & Datstax Have Partnered for Apache Spark Engineering and Support
Big Data AnalyticsWhere We’ve Been
• 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop
• 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others
Big Data AnalyticsA Zoo of Innovation
Big Data AnalyticsA Zoo of Innovation
Big Data AnalyticsA Zoo of Innovation
Big Data AnalyticsA Zoo of Innovation
What's Working?
Many Excellent Innovations Have Come From Big Data Analytics:
• Distributed & Data Parallel is disruptive ... because we needed it
• We Now Have Massive throughput… Solved the ETL Problem
• The Data Hub/Lake Is Possible
What Needs to Improve? Go Beyond MapReduce
MapReduce is a Very Powerful and Flexible Engine
Processing Throughput Previously Unobtainable on
Commodity Equipment
But MapReduce Isn’t Enough:
• Essentially Batch-only
• Inefficient with respect to memory use, latency
• Too Hard to Program
What Needs to Improve? Go Beyond (S)QL
SQL Support Has Been A Welcome Interface on Many
Platforms
And in many cases, a faster alternative
But SQL Is Often Not Enough:
• Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs.
• Machine Learning (see above, plus iterative)
• Multi-step pipelines
• Often an Additional System
What Needs to Improve? Ease of Use
Big Data Distributions Provide a number of Useful Tools and
Systems
Choices are Good to Have
But This Is Often Unsatisfactory:
• Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging
• A typical solution requires stringing together disparate systems - we need unification
• Developers want the full power of their programming language
What Needs to Improve? Latency
Big Data systems are throughput-oriented
Some new SQL Systems provide interactivity
But We Need More:
• Interactivity beyond SQL interfaces
• Repeated access of the same datasets (i.e. caching)
Can Spark Solve These Problems?
Apache SparkOriginally developed in 2009 in UC Berkeley’s
AMPLab
Fully open sourced in 2010 – now at Apache Software Foundation
http://spark.apache.org
Project ActivityJune 2013 June 2014
total contributors 68 255
companies contributing 17 50
total linesof code 63,000 175,000
Project ActivityJune 2013 June 2014
total contributors 68 255
companies contributing 17 50
total linesof code 63,000 175,000
Compared to Other Projects
0
300
600
900
1200
0
75000
150000
225000
300000
Commits Lines of Code Changed
Activity in past 6 months
Compared to Other Projects
0
300
600
900
1200
0
75000
150000
225000
300000
Commits Lines of Code Changed
Activity in past 6 months
Spark is now the most active project in the Hadoop ecosystem
Spark on GithubSo active on Github, sometimes we break it
Over 1200 Forks (can’t display Network Graphs)
~80 commits to master each week
So many PRs We Built our own PR UI
Apache Spark - Easy to Use And Very Fast
Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra
Improved Efficiency: • In-memory computing primitives
• General computation graphs
Improved Usability: • Rich APIs
• Interactive shell
Apache Spark - Easy to Use And Very Fast
Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra
Improved Efficiency: • In-memory computing primitives
• General computation graphs
Improved Usability: • Rich APIs
• Interactive shell
Up to 100× faster (2-10× on disk)
2-5× less code
Apache Spark - A Robust SDK for Big Data Applications
SQL Machine Learning Streaming Graph
Core
Unified System With Libraries to Build a Complete Solution !
Full-featured Programming Environment in Scala, Java, Python…
Very developer-friendly, Functional API for working with Data !
Runtimes available on several platforms
Spark Is A Part Of Most Big Data Platforms
• All Major Hadoop Distributions Include Spark
• Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE
• Spark Applications Can Be Written Once and Deployed Anywhere
SQL Machine Learning Streaming Graph
Core
Deploy Spark Apps Anywhere
Easy: Get Started Immediately
Interactive Shell Multi-language support
Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()
Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()
Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();
Easy: Clean API
Resilient Distributed Datasets
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
• Automatically rebuilt on failure
Operations
• Transformations (e.g. map, filter, groupBy)
• Actions(e.g. count, collect, save)
Write programs in terms of transformations on distributed datasets
Easy: Expressive APImap reduce
Easy: Expressive APImap filter groupBy sort union join leftOuterJoin rightOuterJoin
reduce count fold reduceByKey groupByKey cogroup cross zip
sample take first partitionBy mapWith pipe save ...
Easy: Example – Word Count
public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
Spark
Easy: Example – Word Count
public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
Spark
Easy: Works Well With Hadoop
Data Compatibility
• Access your existing Hadoop Data
• Use the same data formats
• Adheres to data locality for efficient processing
!
Deployment Models
• “Standalone” deployment
• YARN-based deployment
• Mesos-based deployment
• Deploy on existing Hadoop cluster or side-by-side
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
!
w = numpy.random.rand(D)
!
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)
w -= gradient
!
print “Final w: %s” % w
Fast: Using RAM, Operator Graphs
In-memory Caching
• Data Partitions read from RAM instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
Fast: Logistic Regression Performance
Runn
ing
Tim
e (s
)
0
1000
2000
3000
4000
Number of Iterations1 5 10 20 30
Hadoop Spark
110 s / iteration
first iteration 80 s further iterations 1 s
Fast: Scales Down SeamlesslyEx
ecution time (s)
0
25
50
75
100
% of working set in cache
Cache disabled 25% 50% 75% Fully cached
11.5304
29.747140.7407
58.061468.8414
Easy: Fault RecoveryRDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])
HDFS File Filtered RDDMapped
RDDfilter(func = startsWith(…))
map(func = split(...))
How Spark Works
Working With RDDs
Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)
Working With RDDs
RDDRDD
RDDRDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
textFile = sc.textFile(”SomeFile.txt”)
Working With RDDs
RDDRDD
RDDRDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count() 74 !linesWithSpark.first() # Apache Spark
textFile = sc.textFile(”SomeFile.txt”)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
Worker
Worker
Worker
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count() Action
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Drivertasks
tasks
tasks
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Read HDFS Block
Read HDFS Block
Read HDFS Block
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process& Cache Data
Process& Cache Data
Process& Cache Data
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
ProcessfromCache
ProcessfromCache
ProcessfromCache
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driverresults
results
results
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
Example: Log Mining
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk
Load error messages from a log into memory, then interactively search for various patterns
Cassandra + Spark: A Great Combination
Both are Easy to Use
Spark Can Help You Bridge Your Hadoop and Cassandra Systems
Use Spark Libraries, Caching on-top of Cassandra-stored Data
Combine Spark Streaming with Cassandra Storage Datastaxspark-cassandra-connector:https://github.com/datastax/spark-cassandra-connector
Schema RDDs (Spark SQL)• Built-in Mechanism for recognizing Structured data in Spark
• Allow for systems to apply several data access and relational optimizations (e.g. predicate push-down, partition pruning, broadcast joins)
• Columnar in-memory representation when cached
• Native Support for structured formats like parquet, JSON
• Great Compatibility with the Rest of the Stack (python, libraries, etc.)
Top Related