Harnessing Big Data with Spark
-
Upload
alpine-data -
Category
Data & Analytics
-
view
225 -
download
1
Transcript of Harnessing Big Data with Spark
Harnessing Big Data with Spark
Lawrence Spracklen Alpine Data
2
Alpine Data
3
Map Reduce
• Allows the distribution of large data computations across a cluster
• Computations typically composed of a sequence of MR operations
Big Data
Map()
Output
Reduce()
4
MR Performance
• Multiple disk interactions required in EACH MR operation
Map Reduce
5
Performance Hierarchy
0.10GB/s 0.10GB/s 0.60GB/s 80GB/s
100X Read Bandwidth
6
Optimizing MR
• Many companies have significant legacy MR code – Either direct MR or indirect usage via Pig
• A variety of techniques to accelerate MR – Apache Tez – Tachyon or Apache ignite – System ML
7
Spark
• Several significant advancements over MR – Generalizes two stage MR into arbitrary DAGs – Enables in-memory dataset caching – Improved usability
• Reduced disk read/writes delivers significant speedups – Especially for iterative algorithms like ML
8
Perf comparisons
*http://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce
9
Spark Tuning
• Increased reliance on memory introduces greater requirement for tuning
• Need to understand memory requirements for caching
• Significant performance benefits associated with “getting it right”
• Auto-tuning is coming….
10
Optimization opportunities
• Spark delivers improved ML performance using reduced cluster resources
• Enables numerous opportunities – Reduced time to insights – Reduced cluster size – Eliminate subsampling – AutoML
11
AutoML
• Data sets increasingly large and complex • Increasing difficult to intuitively “know” optimal – Feature engineering – Choice of algorithm – Optimize parameterization of algorithm(s)
• Significant manual trial-and-error • Cult of the algorithm
12
Feature Engineering
• Essential for model performance, efficacy, robustness and simplicity – Feature extraction – Feature selection – Feature construction – Feature elimination
• Domain/dataset knowledge is important, but basic automation feasible
13
Algorithm selection
• Select dependent column • Indicate classification or regression • Press “go” Algorithms run in parallel across cluster Minimally provides good starting point Significantly reduces “busy work”
14
Hyperparameter optimization
• Are the default parameters optimal? • How do I adjust intelligently – Number of trees? Depth of trees? Splitting
criteria?
• Tedious trial and error • Overfitting danger • Intelligent automatic search
15
Algorithm tuning
• Gradient boosted tree parameterization e.g. – # of trees – Maximum tree depth – Loss function – Minimum node split size – Bagging rate – Shrinkage
16
AutoML
Data Set
Alg #1
Alg #2
Alg #3
Alg #N
Alg #1
Alg #N
1)Investigate N ML algorithms
2) Tune top performing algorithms
Feature engineering
Alg #2
Alg #1
Alg #N
2) Feature elimination
17
Spark is for large datasets
*http://datascience.la/benchmarking-random-forest-implementations/
• If your data fits on a single node…. • Other high-performance options exist
*http://haifengl.github.io/smile/index.html
Ru
n t
ime
18
Data set size
• Large data lakes can consist of many small files
• Memory per node increasing rapidly
*http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html
19
NVDIMMS
• Driving significant increases in node memory – Up to 10X increase in density
• Coming in late 2016…
20
Hybrid operators
• Time consuming to maintain multiple ML libraries & manually determine optimal choice
• Develop hybrid implementations that automatically choose optimal approach – Data set size – Cluster size – Cluster utilization
21
Single-node performance (1/2)
*http://www.ayasdi.com/blog/LawrenceSpracklen
22
Single-node performance (2/2)
*http://www.ayasdi.com/blog/LawrenceSpracklen
23
Operationalization
• What happens after the models are created? • How does the business benefit from the
insights? • Operationalization is frequently the weak link – Operationalizing PowerPoint? – Hand rolled scoring flows
24
PFA
• Portable Format for Analytics (PFA) • Successor to PMML • Significant flexibility in encapsulating complex
data preprocessing
25
Conclusions
• Spark delivers significant performance improvements over MR – Can introduce more tuning requirements
• Provides an opportunity for AutoML – Automatically determine good solutions
• Understand when its appropriate • Don’t forget about about operationalization