Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and...
Transcript of Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and...
![Page 1: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/1.jpg)
Data processing in Apache Spark
Pelle Jakovits
5 October, 2015, Tartu
![Page 2: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/2.jpg)
Outline
• Introduction to Spark
• Resilient Distributed Datasets (RDD)
– Data operations
– RDD transformations
– Examples
• Fault tolerance
• Frameworks powered by Spark
Pelle Jakovits 2/34
![Page 3: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/3.jpg)
Spark
• Directed acyclic graph (DAG) task execution engine
• Supports cyclic data flow and in-memory computing
• Spark works with Scala, Java, Python and R
• Integrated with Hadoop Yarn and HDFS
• Extended with tools for SQL like queries, stream processingand graph processing.
• Uses Resilient Distributed Datasets to abstract data that is tobe processed
Pelle Jakovits 3/34
![Page 4: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/4.jpg)
Hadoop YARN
Pelle Jakovits 4/34
![Page 5: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/5.jpg)
Performance vs Hadoop
Pelle Jakovits
0,96
110
0 25 50 75 100 125
Logistic Regression
4,1
155
0 30 60 90 120 150 180
K-Means ClusteringHadoop
Spark
Time per Iteration (s)
Introduction to Spark – Patrick Wendell, Databricks
5/34
![Page 6: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/6.jpg)
Resilient Distributed Datasets
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
• Automatically rebuilt on failure
Pelle Jakovits 6/34
![Page 7: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/7.jpg)
Working in Java
• Tuples
• Functions– In Java 8 you can use lamda functions
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
– But In older Java you have to use predefined function interfaces:• Function,
• Function2, Function 3
• FlatMapFunction
• PairFunction
Pelle Jakovits
Tuple2 pair = new Tuple2(a, b);pair._1 // => apair._2 // => b
7/34
![Page 8: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/8.jpg)
Java Spark Function types
class GetLength implements Function<String, Integer> {
public Integer call(String s) {
return s.length();
}
}
class Sum implements Function2<Integer, Integer, Integer> {
public Integer call(Integer a, Integer b) {
return a + b;
}
}
Pelle Jakovits 8/34
![Page 9: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/9.jpg)
Java Example - MapReduce
JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);
int count = dataSet.map(new Function<Integer, Integer>() {public Integer call(Integer integer) {
double x = Math.random() * 2 - 1;double y = Math.random() * 2 - 1;return (x * x + y * y < 1) ? 1 : 0;
}}).reduce(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer integer, Integer integer2) {return integer + integer2;
}});
Pelle Jakovits 9/34
![Page 10: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/10.jpg)
Python example
• Word count in Spark's Python API
file = spark.textFile("hdfs://...")
file.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
Pelle Jakovits 10/34
![Page 11: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/11.jpg)
Persisting data
• Spark is Lazy
• To force spark to keep any intermediate data in memory, we can use:
– lineLengths.persist(StorageLevel);
– which would cause lineLengths RDD to be saved in memory after the first time it is computed.
• Should be used in case we want to process the same RDD multiple times.
Pelle Jakovits 11/34
![Page 12: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/12.jpg)
Persistance level
• DISK_ONLY
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
– More efficient
– Use more CPU
• MEMORY_ONLY_2
– Replicate data on 2 executors
Pelle Jakovits 12/34
![Page 13: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/13.jpg)
RDD operations
• Actions
– Creating RDD’s
– Storing RDD’s
– Extracting data from RDD on the fly
• Transformations
– Restructure or transform RDD data
Pelle Jakovits 13/34
![Page 14: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/14.jpg)
Spark Actions
Pelle Jakovits 14
![Page 15: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/15.jpg)
Loading Data
Local data
External data
Pelle Jakovits
int[] data = {1, 2, 3, 4, 5};JavaRDD<Integer> distData = sc.parallelize(data, slices);
JavaRDD<String> input = sc.textFile(“file.txt”)sc.textFile(“directory/*.txt”)sc.textFile(“hdfs://xxx:9000/path/file”)
15/34
![Page 16: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/16.jpg)
Broadcast
• Broadcast a copy of the data to every node in the Spark cluster:
Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3});
Int[] values = broadcastVar.value();
Pelle Jakovits 16/34
![Page 17: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/17.jpg)
Storing data
• counts.saveAsTextFile("hdfs://...");
• counts.saveAsObjectFile("hdfs://...");
• DataCube.saveAsHadoopFile("testfile.seq", LongWritable.class, LongWritable.class, SequenceFileOutputFormat.class);
Pelle Jakovits 17/34
![Page 18: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/18.jpg)
Other actions
• Reduce() – we already saw in example
• Collect() – Retrieve RDD content.
• Count() – count number of elements in RDD
• First() – Take first element from RDD
• Take(n) - Take n first elements from RDD
• countByKey() – count values for each uniquekey
Pelle Jakovits 18/34
![Page 19: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/19.jpg)
Spark RDD Transformations
Pelle Jakovits 19
![Page 20: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/20.jpg)
Map
JavaPairRDD<String, Integer> ones = words.map(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2(s, 1);
}
}
);
Pelle Jakovits 20/34
![Page 21: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/21.jpg)
groupBy
JavaPairRDD<Integer, List<Tuple2<Integer, Float>>> grouped = values.groupBy(new Partitioner(splits), splits);
public class Partitioner extends Function<Tuple2<Integer, Float>, Integer>{
public Integer call(Tuple2<Integer, Float> t) {
return r.nextInt(partitions);
}
}
Pelle Jakovits 21/34
![Page 22: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/22.jpg)
reduceByKey
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);
Pelle Jakovits 22/34
![Page 23: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/23.jpg)
FilterJavaRDD<String> logData = sc.textFile(logFile);
long ERRORS = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains(“ERROR");
}
}
).count();
long INFOS = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains(“INFO");
}
}).count();
Pelle Jakovits 23/34
![Page 24: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/24.jpg)
Other transformations
• sample(withReplacement, fraction, seed)• distinct([numTasks]))• union(otherDataset)• flatMap(func)• groupByKey()• join(otherDataset, [numTasks]) - When called on datasets
of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
• cogroup(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Seq[V], Seq[W]) tuples.
Pelle Jakovits 24/34
![Page 25: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/25.jpg)
Fault Tolerance
Pelle Jakovits 25
![Page 26: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/26.jpg)
Lineage
• Lineage is the history of RDD
• RDD’s keep track of all the RDD partitions– What functions were appliced to procduce it
– Which input data partition were involved
• Rebuild lost RDD partitions according to lineage, using the latest still available partitions.
• No performance cost if nothing fails (as opposite to checkpointing)
Pelle Jakovits 26/34
![Page 27: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/27.jpg)
Frameworks powered by Spark
Pelle Jakovits 27/34
![Page 28: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/28.jpg)
Frameworks powered by Spark
• Spark SQL- Seemlessly mix SQL queries with Spark programs.
– Similar to Pig and Hive
• MLlib - machine learning library
• GraphX - Spark's API for graphs and graph-parallel computation.
Pelle Jakovits 28/34
![Page 29: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/29.jpg)
Advantages of Spark
• Much faster than solutions bult ontop of Hadoop when data can fit into memory
– Except Impala maybe
• Hard to keep in track of how (well) the data isdistributed
• More flexible fault tolerance
• Spark has a lot of extensions and is constantlyupdated
Pelle Jakovits 29/34
![Page 30: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/30.jpg)
Disadvantages of Spark
• What if data does not fit into the memory?
• Saving as text files can be very slow
• Java Spark is not as convinient to use as Pig forprototyping, but you can
– Use python Spark instead
– Use Spark Dataframes
– Use Spark SQL
Pelle Jakovits 30/34
![Page 31: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/31.jpg)
Conclusion
• RDDs offer a simple and efficientprogramming model for a broad range ofapplications
• Spark achieves fault tolerance by providing coarse-grained operations and tracking lineage
• Provides definite speedup when data fits intothe collective memory
Pelle Jakovits 31/34
![Page 32: Data processing in Apache Spark•Next week`s lecture is about higher level Spark –Scripting and Prototyping in Spark •Spark SQL •DataFrames –Spark Streaming Pelle Jakovits](https://reader034.fdocuments.us/reader034/viewer/2022042300/5ecb032934d71d0c1477a8b0/html5/thumbnails/32.jpg)
Thats All
• This week`s practice session
– Processing data with Spark
• Next week`s lecture is about higher level Spark
– Scripting and Prototyping in Spark
• Spark SQL
• DataFrames
– Spark Streaming
Pelle Jakovits 32/34