Boosting big data with apache spark
Transcript of Boosting big data with apache spark
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science Company
Boosting Big Data with Apache Spark
Mathias LavaertApril 2015
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
About Infofarm
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science
Big Data
Identifying, extracting and using data of all types
and origins; exploring, correlating and using it in new
and innovative ways in order to extract meaning
and business value from it.
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Java
PHPE-Commerce
Mobile
Web
Development
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
About me
Mathias LavaertBig Data Developer at InfoFarm since May, 2014
Proud citizen of West-Flanders
Outdoor enthusiast
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Agenda
• What is Apache Spark?
• An in-depth overview– Spark Core and Resilient Distributed Data
– Unified access to structured data with Spark SQL
– Machine Learning with Spark MLLib
– Scalable streaming applications Spark Streaming
• Q&A
• Wrap-up & lunch
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
What is Apache Spark?
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
“Apache Spark is a fast and general engine for big data
processing, with built-in modules for streaming, SQL,
machine learning and graph processing”
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
History
• Created by Matei Zaharia at UC Berkeley in 2009
• Based on 2007 Microsoft Dryad paper
• Donated in 2013 to Apache Software Foundation
• 465 contributors in 2014 making it the most active
Apache Project
• Currently supported by Databricks, a company founded
by the creators of Apache Spark
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Target users
● Data Scientists○ Data exploration and data modelling using interactive
shells
○ Machine Learning
○ Ad Hoc analysis to answer business questions or
discovering new insights
● Engineers○ Fault-tolerant production data applications
○ ‘Productizing’ the work of the data scientist
○ Integration with business application
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Where to situate Apache Spark?
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Differences with MapReduce
• Faster by minimizing IO and trying to use
the memory as much as possible
• Unified libraries
• Huge community effort, very fast
development pace.
• Ships with higher level tools included
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Daytona GraySort Contest
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Differences with Hive, Pig, others...
• One integrated framework that suits a
wide range of problems
• No need for a workflow application like
Oozie
• Only 1 language/framework to learn
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Explosion of Specialized Systems
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Architecture
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Advantages of unified libraries
Advancements in higher-level libraries are pushed down into core and
vice-versa
● Spark Core
○ Highly-optimized, low overhead, network-saturating shuffle
● Spark Streaming
○ Garbage collection, memory management, cleanup
improvements
● Spark GraphX
○ IndexedRDD for random access within a partition vs scanning
entire partition
● Spark MLLib
○ Statistics (Correlations, sampling, heuristics)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Supported languages
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Difference between Java and Scala
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Cluster Resource Managers● Spark Standalone
○ Suitable for a lot of production workloads
○ Only suitable for Spark workloads
● YARN
○ Allows hierarchies of resources
○ Kerberos integration
○ Multiple workloads from different execution frameworks
■ Hive, Pig, Spark, MapReduce, Cascading, etc…
● Mesos
○ Similar to YARN, but allows elastic allocation
○ Coarse-grained
■ Single, long-running Mesos tasks runs Spark mini tasks
○ Fine-grained
■ New Mesos task for each Spark task
■ Higher overhead, not good for long-running Spark jobs
(Streaming)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Storage Layers for Spark
Spark can create distributed datasets from:
● Any file stored in the Hadoop distributed filesystem (HDFS)
● Any storage system supported by the Hadoop APIs
○ Local filesystem
○ S3
○ Cassandra
○ Hive
○ HBase
Note that Apache Spark doesn’t require Hadoop, but it has support for
storage systems implementing the Hadoop APIs.
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Short introduction to functional
programming
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
What is functional programming?
A programming paradigm where the
basic unit of abstraction is the function
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Basic concepts ● Higher-order functions
○ Are functions that can either take other functions as
arguments
○ or return functions as a result of a function
● Pure functions
○ Purely functional expressions have no side effects
● Recursion
○ Iteration in functional languages is usually
accomplished via recursion.
● Immutable data structures
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Small example with a functional
language: Scala
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Introduction to Spark concepts
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Resilient Distributed Datasets (RDDs)● Core Spark abstraction
● Immutable distributed collection of objects
● Split into multiple partitions
● May be computed on different nodes of the cluster
● Can contain any type of Scala, Java or Python object
including user-defined classes
“Distributed Scala collections”
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Driver and context● Driver
○ Shell
○ Standalone program
● Spark Context represents a connection to a computing cluster
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
RDD Operations● Transformations
○ map
○ filter
○ flatMap
○ sample
○ groupByKey
○ reduceByKey
○ union
○ join
○ sort
● Actions
○ count
○ collect
○ reduce
○ lookup
○ save
● Transformations are lazy
● Actions force the computation of transformations
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Narrow vs wide dependencies
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Demo using only core operations
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Specialized operations for specific
types of RDDs
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Specialized operations for Key/Value pairs
● reduceByKey
● groupByKey
● combineByKey
● mapValues
● flatMapValues
● keys
● sortByKey
● subtractByKey
● join
● rightOuterJoin
● leftOuterJoin
● cogroup
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Specialized operations for numeric RDDs
● count
● mean
● sum
● max
● min
● variance
● sampleVariance
● stdev
● sampleStDev
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
And many more...
● HadoopRDD
● FilteredRDD
● MappedRDD
● PairRDD
● ShuffledRDD
● UnionRDD
● DoubleRDD
● JdbcRDD
● JsonRDD
● SchemaRDD
● VertexRDD
● EdgeRDD
● CassandraRDD
● GeoRDD
● EsSpark (Elastic Search
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Spark SQL
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark SQL Overview● Newest component of Spark
● Tightly integrated to work with structured data
○ Tables with rows and columns
● Transform RDDs using SQL
● Data source integration: Hive, Parquet, JSON and more…
● Optimizes execution plan
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Differences with Spark Core● Spark + RDDs
○ Functional transformations on
collections of objects
● SQL + SchemaRDDs
○ Declarative transformations on
collections of tuples
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Getting started with Spark SQL● Create an instance of SQLContext or HiveContext
○ Entry point for all SQL functionality
○ Wraps/extends existing Spark Context (Decorator Pattern)
● If you’re using the shell a SQLContext has been created for you
val sparkContext = new SparkContext("local[4]", "SQL")
val sqlContext = new SQLContext(sparkContext)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Language Integrated UDFs● Ability to write custom SQL-functions in one of the languages that is
supported by Spark
● Another example on how Spark simplifies the big data stack
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Parquet compatibilityNative support for reading data stored in Parquet:
● Columnar storage avoids reading unneeded data
● SchemaRDDs can be written to Parquet while preserving the schema
● Convert other slower formats like JSON to Parquet for repeated querying.
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Demo: Spark SQL
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Spark MLLib
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Machine Learning Algorithms● Supervised
○ Prediction: Train a model with existing data + label, predict
label for new data
■ Classification (categorical)
■ Regression (continuous numeric)
○ Recommendation: recommend to similar users
■ User -> user, item -> item, user -> item similarity
● Unsupervised
○ Clustering: Find natural clusters in data based on similarities
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Algorithms provided by Spark● Classification and regression
○ Linear models (SVMs, logistic regression, linear regression)
○ Naive Bayes
○ Decision trees
○ Ensembles of trees (Random Forests and Gradient-Boosted trees)
○ Isotonic regression
● Recommendations
○ Alternating Least Squares (ALS)
○ FP-growth
● Clustering
○ K-Means
○ Gaussian mixture
○ Power Iteration clustering
○ Latent Dirichlet allocation
○ Streaming k-means
● Dimensionality reduction
○ Singular value decomposition (SVD)
○ Principal component analysis (PCA)
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Tools provided by Spark
● Tools for basic statistics including
○ Summary statistics
○ Correlations
○ Sampling
○ Hypothesis testing
○ Random data generation
● Tools for feature extraction and transformation
○ Extracting features out of text
○ Uniform Vector format to store features
● Tools to build Machine Learning Pipelines
using Spark SQL
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Why choose for MLLib?
● One of the best documented machine learning
libraries available for the JVM
● Simple API, constructs are the same for different
algorithms
● Well integrated with other Spark-components
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Demo: Spark MLLib
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Spark Streaming
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark Streaming Overview
● Build around the concept of DStreams or discretized
streams
● Long-running Spark application
● Micro-batch architecture
● Supports Flume, Kafka, Twitter, Amazon Kinesis,
Socket, File…
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
DStreams
● A sequence of RDDs
● Stateless transformations
● Stateful transformations
● Checkpointing
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Spark Streaming Use Cases
● ETL and enrichment of streaming data on ingestion
● Lambda Architecture
● Operational dashboards
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Demo: Spark Streaming
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Spark on Amazon EC2
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Spark runs easily on Amazon EC2
Apache Spark comes with a script to launch Spark clusters
on Amazon EC2.
So there is no need to invest in a cluster of servers...
Furthermore it has support for multiple Amazon
components.
● Spark can read files from Amazon S3
● Spark Streaming can easily be integrated with Amazon
Kinesis
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Conclusion
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Why choose for Apache Spark?
● Modern integrated full-stack Big Data framework
● Suitable for both batch and (near) real time applications
● Well supported by a very large community
● The Big Data landscape seems to shift to Apache Spark
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Questions?