Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

30
Boston Apache Spark User Group (the Spahk group) Microsoft NERD Center - Horace Mann Tuesday, 15 July 2014

description

An introduction to Apache Spark presented at the Boston Apache Spark User Group in July 2014

Transcript of Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Page 1: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Boston Apache Spark User Group

(the Spahk group)Microsoft NERD Center - Horace Mann

Tuesday, 15 July 2014

Page 2: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Intro to Apache SparkMatthew Farrellee, @spinningmatt

Updated: July 2014

Page 3: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Background - MapReduce / Hadoop

● Map & reduce around for 5+ decades (McCarthy, 1960)

● Dean and Ghemawat demonstrate map and reduce for distributed data processing (Google, 2004)

● MapReduce paper timed well with commodity hardware capabilities of the early 2000s

● Open source implementation in 2006● Years of innovation improving, simplifying,

expanding

Page 4: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

MapReduce / Hadoop difficulties

● Hardware evolved○ Networks became fast○ Memory became cheap

● Programming model proved non-trivial○ Gave birth to multiple attempts to simplify, e.g. Pig,

Hive, ...

● Primarily batch execution mode○ Begat specialized (non-batch) modes, e.g. Storm,

Drill, Giraph, ...

Page 5: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Some history - Spark

● Started in UC Berkeley AMPLab by Matei Zaharia, 2009○ AMP = Algorithms Machines People○ AMPLab is integrating Algorithms, Machines, and

People to make sense of Big Data● Open sourced, 2010● Donated to Apache Software Foundation,

2013● Graduated to top level project, 2014● 1.0 release, May 2014

Page 6: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

What is Apache Spark?

An open source, efficient and productive cluster computing system that is interoperable with Hadoop

Page 7: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Open source

● Top level Apache project● http://www.ohloh.net/p/apache-spark

○ In a Nutshell, Apache Spark…○ has had 7,366 commits made by 299 contributors

representing 117,823 lines of code○ is mostly written in Scala with a well-commented

source code○ has a codebase with a long source history

maintained by a very large development team with increasing Y-O-Y commits

○ took an estimated 30 years of effort (COCOMO model) starting with its first commit in March, 2010 ending with its most recent commit 5 days ago

Page 8: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Efficient

● In-memory primitives○ Use cluster memory and spill to disk only when

necessary● High performance

○ https://amplab.cs.berkeley.edu/benchmark/● General compute graphs, DAGs

○ Not just: Load -> Map -> Reduce -> Store -> Load -> Map -> Reduce -> Store

○ Rich and pipelined: Load -> Map -> Union -> Reduce -> Filter -> Group -> Sample -> Store

Page 9: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Interoperable

● Read and write data from HDFS (or any storage system with an HDFS-like API)

● Read and write Hadoop file formats

● Run on YARN

● Interact with Hive, HBase, etc

Page 10: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Productive

● Unified data model, the RDD● Multiple execution modes

○ Batch, interactive, streaming● Multiple languages

○ Scala, Java, Python, R, SQL● Rich standard library

○ Machine learning, streaming, graph processing, ETL● Consistent API across languages● Significant code reduction compared to

MapReduce

Page 11: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Consistent API, less code...

public static class WordCountMapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

output.collect(word, one);

}

}

}

public static class WorkdCountReduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

}

import org.apache.spark._

val sc = new SparkContext(new SparkConf().setAppName(“word count”))

val file = sc.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Spark (Scala) -MapReduce -

from operator import add

from pyspark import SparkContext

sc = SparkContext(conf=SparkConf().setAppName(“word count”))

file = sc.textFile("hdfs://...")

counts = file.flatMap(lambda line: line.split(" "))

.map(lambda word: (word, 1))

.reduceByKey(add)

counts.saveAsTextFile("hdfs://...")

Spark (Python) -

Page 12: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Spark workflow

Valuemattf

mattfRDD

Transform

Action

Load Save

Page 13: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

The RDD

The resilient distributed dataset

A lazily evaluated, fault-tolerant collection of elements that can be operated on in parallel

Valuemattf

mattfRDD

Transform

Action

Load Save

Page 14: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

RDDs technically

1. a set of partitions ("splits" in hadoop terms)2. list of dependencies on parent RDDs3. function to compute a partition given its

parents4. optional partitioner (hash, range)5. optional preferred location(s) for each

partition

Valuemattf

mattfRDD

Transform

Action

Load Save

Page 15: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Load Valuemattf

mattfRDD

Transform

Action

Load Save

Create an RDD.

● parallelize - convert a collection● textFile - load a text file● wholeTextFiles - load a dir of text files● sequenceFile / hadoopFile - load using

Hadoop file formats● More, http://spark.apache.

org/docs/latest/programming-guide.html#external-datasets

Page 16: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Lazy operations.Build compute DAG.Don’t trigger computation.

● map(func) - elements passed through func● flatMap(func) - func can return >=0 elements● filter(func) - subset of elements● sample(..., fraction, …) - select fraction● union(other) - union of two RDDs● distinct - new RDD w/ distinct elements

Transform Valuemattf

mattfRDD

Transform

Action

Load Save

Page 17: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

● groupByKey - (K, V) -> (K, Seq[V])● reduceByKey(func) - (K, V) -> (K, func(V...))● sortByKey - order (K, V) by K● join(other) - (K, V) + (K, W) -> (K, (V, W))● cogroup/groupWith(other) - (K, V) + (K, W) -

> (K, (Seq[V], Seq[W]))● cartesian(other) - cartesian product (all

pairs)

Transform (cont) Valuemattf

mattfRDD

Transform

Action

Load Save

Page 19: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Action Valuemattf

mattfRDD

Transform

Action

Load Save

Active operations.Trigger execution of DAG.Result in a value.

● reduce(func) - reduce elements w/ func● collect - convert to native collection● count - count elements● foreach(func) - apply func to elements● take(n) - return some elements

Page 21: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Save Valuemattf

mattfRDD

Transform

Action

Load Save

Action that results indata stored to file system.

● saveAsTextFile● saveAsSequenceFile● saveAsObjectFile/saveAsPickleFile● More, http://spark.apache.

org/docs/latest/programming-guide.html#actions

Page 22: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Spark workflow

Valuemattf

mattfRDD

Transform

Action

Load Save

file = sc.textFile("hdfs://...")counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(add)counts.saveAsTextFile("hdfs://...")

Page 23: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Multiple modes, rich standard library

Spark Core

SQL Streaming MLLib GraphX ...

Apache Spark

Page 24: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Spark SQL● Components

○ Catalyst - generic optimization for relational algebra○ Core - RDD execution; formats: Parquet, JSON○ Hive support - run HiveQL and use Hive warehouse

● SchemaRDDs and SQL

Spark

SQL Stream MLlib GraphX ...

User

User

User Name Age Height

Name Age Height

Name Age Height

sqlCtx.sql(“SELECT name FROM people WHERE age >= 13 AND age <= 19”)

RDD SchemaRDD

Page 25: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Spark StreamingSpark

SQL Stream MLlib GraphX ...

● Run a streaming computation as a series of time bound, deterministic batch jobs

● Time bound used to break stream into RDDs

Spark Streaming Spark

Stream RDDs Results

X seconds wide

Page 26: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

MLlibSpark

SQL Stream MLlib GraphX ...

● Machine learning algorithms over RDDs

● Classification - logistic regression, linear support vector machines, naive Bayes, decision trees

● Regression - linear regression, regression trees● Collaborative filtering - alternating least squares● Clustering - K-Means● Optimization - stochastic gradient descent, limited-

memory BFGS● Dimensionality reduction - singular value

decomposition, principal component analysis

Page 27: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Deploying Spark

Source: http://spark.apache.org/docs/latest/cluster-overview.html

Page 28: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Deploying Spark

● Driver program - shell or standalone program that creates a SparkContext and works with RDDs

● Cluster Manager - standalone, Mesos or YARN○ Standalone - the default, simple setup,

master + worker processes on nodes○ Mesos - a general purpose manager that

runs Hadoop and other services. Two modes of operation, fine & coarse.

○ YARN - Hadoop 2’s resource manager

Page 29: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Highlights from Spark Summit 2014

http://spark-summit.org/east/2015

New York in early 2015

Page 30: Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15 July 2014

Thanks to...

@pacoid, @rxin and @michaelarmbrust for letting me crib slides for this introduction