Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache SparkEasy and Fast Big Data Analytics

Pat McDonough

Founded by the creators of Apache Sparkout of UC Berkeley’s AMPLab

Fully committed to 100% open source Apache Spark

Support and Grow the Spark Community and Ecosystem

Building Databricks Cloud

Databricks & DatastaxApache Spark is packaged as part of Datastax

Enterprise Analytics 4.5

Databricks & Datstax Have Partnered for Apache Spark Engineering and Support

Big Data AnalyticsWhere We’ve Been

• 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop

• 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others

Big Data AnalyticsA Zoo of Innovation

What's Working?

Many Excellent Innovations Have Come From Big Data Analytics:

• Distributed & Data Parallel is disruptive ... because we needed it

• We Now Have Massive throughput… Solved the ETL Problem

• The Data Hub/Lake Is Possible

What Needs to Improve? Go Beyond MapReduce

MapReduce is a Very Powerful and Flexible Engine

Processing Throughput Previously Unobtainable on

Commodity Equipment

But MapReduce Isn’t Enough:

• Essentially Batch-only

• Inefficient with respect to memory use, latency

• Too Hard to Program

What Needs to Improve? Go Beyond (S)QL

SQL Support Has Been A Welcome Interface on Many

Platforms

And in many cases, a faster alternative

But SQL Is Often Not Enough:

• Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs.

• Machine Learning (see above, plus iterative)

• Multi-step pipelines

• Often an Additional System

What Needs to Improve? Ease of Use

Big Data Distributions Provide a number of Useful Tools and

Systems

Choices are Good to Have

But This Is Often Unsatisfactory:

• Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging

• A typical solution requires stringing together disparate systems - we need unification

• Developers want the full power of their programming language

What Needs to Improve? Latency

Big Data systems are throughput-oriented

Some new SQL Systems provide interactivity

But We Need More:

• Interactivity beyond SQL interfaces

• Repeated access of the same datasets (i.e. caching)

Can Spark Solve These Problems?

Apache SparkOriginally developed in 2009 in UC Berkeley’s

AMPLab

Fully open sourced in 2010 – now at Apache Software Foundation

http://spark.apache.org

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Compared to Other Projects

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Compared to Other Projects

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Spark is now the most active project in the Hadoop ecosystem

Spark on GithubSo active on Github, sometimes we break it

Over 1200 Forks (can’t display Network Graphs)

~80 commits to master each week

So many PRs We Built our own PR UI

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Up to 100× faster (2-10× on disk)

2-5× less code

Apache Spark - A Robust SDK for Big Data Applications

SQL Machine Learning Streaming Graph

Unified System With Libraries to Build a Complete Solution !

Full-featured Programming Environment in Scala, Java, Python…

Very developer-friendly, Functional API for working with Data !

Runtimes available on several platforms

Spark Is A Part Of Most Big Data Platforms

• All Major Hadoop Distributions Include Spark

• Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE

• Spark Applications Can Be Written Once and Deployed Anywhere

SQL Machine Learning Streaming Graph

Deploy Spark Apps Anywhere

Easy: Get Started Immediately

Interactive Shell Multi-language support

Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Easy: Clean API

Resilient Distributed Datasets

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Operations

• Transformations (e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Write programs in terms of transformations on distributed datasets

Easy: Expressive APImap reduce

Easy: Expressive APImap filter groupBy sort union join leftOuterJoin rightOuterJoin

reduce count fold reduceByKey groupByKey cogroup cross zip

sample take first partitionBy mapWith pipe save ...

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Easy: Works Well With Hadoop

Data Compatibility

• Access your existing Hadoop Data

• Use the same data formats

• Adheres to data locality for efficient processing

Deployment Models

• “Standalone” deployment

• YARN-based deployment

• Mesos-based deployment

• Deploy on existing Hadoop cluster or side-by-side

Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache()

w = numpy.random.rand(D)

for i in range(iterations):

gradient = data

.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient

print “Final w: %s” % w

Fast: Using RAM, Operator Graphs

In-memory Caching

• Data Partitions read from RAM instead of disk

Operator Graphs

• Scheduling Optimizations

• Fault Tolerance

= cached partition

filter

groupBy

Stage 3

Stage 1

Stage 2

C: D: E:

Fast: Logistic Regression Performance

Number of Iterations1 5 10 20 30

Hadoop Spark

110 s / iteration

first iteration 80 s further iterations 1 s

Fast: Scales Down SeamlesslyEx

ecution time (s)

% of working set in cache

Cache disabled 25% 50% 75% Fully cached

11.5304

29.747140.7407

58.061468.8414

Easy: Fault RecoveryRDDs track lineage information that can be used to

efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDDMapped

RDDfilter(func = startsWith(…))

map(func = split(...))

How Spark Works

Working With RDDs

textFile = sc.textFile(”SomeFile.txt”)

Working With RDDs

RDDRDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

Working With RDDs

RDDRDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count() 74 !linesWithSpark.first() # Apache Spark

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Worker

Driver

Worker

Driver

lines = spark.textFile(“hdfs://...”)

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Driver

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Driver

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Drivertasks

Worker

Block 1

Block 2

Block 3

Driver

Read HDFS Block

Worker

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process& Cache Data

Worker

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

Worker

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Worker

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

Driver

Worker

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

Driver

ProcessfromCache

Worker

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

Driverresults

results

Example: Log Mining

Worker

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

Driver

Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk

Load error messages from a log into memory, then interactively search for various patterns

Cassandra + Spark: A Great Combination

Both are Easy to Use

Spark Can Help You Bridge Your Hadoop and Cassandra Systems

Use Spark Libraries, Caching on-top of Cassandra-stored Data

Combine Spark Streaming with Cassandra Storage Datastaxspark-cassandra-connector:https://github.com/datastax/spark-cassandra-connector

Schema RDDs (Spark SQL)• Built-in Mechanism for recognizing Structured data in Spark

• Allow for systems to apply several data access and relational optimizations (e.g. predicate push-down, partition pruning, broadcast joins)

• Columnar in-memory representation when cached

• Native Support for structured formats like parquet, JSON

• Great Compatibility with the Rest of the Stack (python, libraries, etc.)

Thank You!Visit http://databricks.com:Blogs, Tutorials and more

Questions?

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Technology

Transcript of Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Beginning Apache Cassandra Development

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

Apache Cassandra 2.0

Support Apache Cassandra in Production · Anuj Wadehra . Architect & Cassandra SME . Ericsson R & D . Support APACHE Cassandra in Production

Apache Cassandra at Wayin

Apache cassandra & apache spark for time series data

NOSQL Database: Apache Cassandra

Apache Cassandra at Macys

Cassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra

State of Cassandra, 2012 - NoSQL | Apache Cassandra · State of Cassandra, 2012 Jonathan Ellis Project Chair, Apache Cassandra CTO, DataStax @spyced ©2012 DataStax Some Cassandra

Amazon Managed Apache Cassandra Service - Developer GuideCassandra Query Language (CQL) is the primary language for communicating with Apache Cassandra. Amazon Managed Apache Cassandra

with Kaa, Apache Cassandra, and Apache Zeppelin … · Real-time IoT data analytics and visualization with Kaa, Apache Cassandra, and Apache Zeppelin. Agenda Why Kaa? Why Cassandra?

Presentation of Apache Cassandra

Apache Cassandra and Go

Cassandra Community Webinar - Introduction To Apache Cassandra 1.2

DataStax Distribution of Apache Cassandra · Distribution of Apache Cassandra Steps to install or uninstall the DataStax Distribution of Apache Cassandra™ Installing DataStax Distribution

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Cassandra Day Atlanta 2015: Troubleshooting with Apache Cassandra

Apache Cassandra™ Documentationcourses.physics.illinois.edu/cs425/fa2017/cassandra10.pdfApache Cassandra 1.0 Documentation Introduction to Apache Cassandra Apache Cassandra is a

Introduction to Apache Cassandra - DataStax - · PDF fileIntroduction to Apache Cassandra . 2" ... Apache Cassandra™ is a massively scalable NoSQL database. Cassandra’s technical