Harnessing Big Data with Spark

25
Harnessing Big Data with Spark Lawrence Spracklen Alpine Data

Transcript of Harnessing Big Data with Spark

Page 1: Harnessing Big Data with Spark

Harnessing Big Data with Spark

Lawrence Spracklen Alpine Data

Page 2: Harnessing Big Data with Spark

2

Alpine Data

Page 3: Harnessing Big Data with Spark

3

Map Reduce

•  Allows the distribution of large data computations across a cluster

•  Computations typically composed of a sequence of MR operations

Big  Data  

Map()

Output  

Reduce()

Page 4: Harnessing Big Data with Spark

4

MR Performance

•  Multiple disk interactions required in EACH MR operation

Map   Reduce  

Page 5: Harnessing Big Data with Spark

5

Performance Hierarchy

0.10GB/s 0.10GB/s 0.60GB/s 80GB/s

100X Read Bandwidth

Page 6: Harnessing Big Data with Spark

6

Optimizing MR

•  Many companies have significant legacy MR code –  Either direct MR or indirect usage via Pig

•  A variety of techniques to accelerate MR –  Apache Tez –  Tachyon or Apache ignite –  System ML

Page 7: Harnessing Big Data with Spark

7

Spark

•  Several significant advancements over MR –  Generalizes two stage MR into arbitrary DAGs –  Enables in-memory dataset caching –  Improved usability

•  Reduced disk read/writes delivers significant speedups –  Especially for iterative algorithms like ML

Page 8: Harnessing Big Data with Spark

8

Perf comparisons

*http://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

Page 9: Harnessing Big Data with Spark

9

Spark Tuning

•  Increased reliance on memory introduces greater requirement for tuning

•  Need to understand memory requirements for caching

•  Significant performance benefits associated with “getting it right”

•  Auto-tuning is coming….

Page 10: Harnessing Big Data with Spark

10

Optimization opportunities

•  Spark delivers improved ML performance using reduced cluster resources

•  Enables numerous opportunities –  Reduced time to insights –  Reduced cluster size –  Eliminate subsampling –  AutoML

Page 11: Harnessing Big Data with Spark

11

AutoML

•  Data sets increasingly large and complex •  Increasing difficult to intuitively “know” optimal –  Feature engineering –  Choice of algorithm –  Optimize parameterization of algorithm(s)

•  Significant manual trial-and-error •  Cult of the algorithm

Page 12: Harnessing Big Data with Spark

12

Feature Engineering

•  Essential for model performance, efficacy, robustness and simplicity –  Feature extraction –  Feature selection –  Feature construction –  Feature elimination

•  Domain/dataset knowledge is important, but basic automation feasible

Page 13: Harnessing Big Data with Spark

13

Algorithm selection

•  Select dependent column •  Indicate classification or regression •  Press “go” Algorithms run in parallel across cluster Minimally provides good starting point Significantly reduces “busy work”

Page 14: Harnessing Big Data with Spark

14

Hyperparameter optimization

•  Are the default parameters optimal? •  How do I adjust intelligently –  Number of trees? Depth of trees? Splitting

criteria?

•  Tedious trial and error •  Overfitting danger •  Intelligent automatic search

Page 15: Harnessing Big Data with Spark

15

Algorithm tuning

•  Gradient boosted tree parameterization e.g. –  # of trees –  Maximum tree depth –  Loss function –  Minimum node split size –  Bagging rate –  Shrinkage

Page 16: Harnessing Big Data with Spark

16

AutoML

Data  Set  

Alg  #1  

Alg  #2  

Alg  #3  

Alg  #N  

Alg  #1  

Alg  #N  

1)Investigate N ML algorithms

2) Tune top performing algorithms

Feature  engineering  

Alg  #2  

Alg  #1  

Alg  #N  

2) Feature elimination

Page 17: Harnessing Big Data with Spark

17

Spark is for large datasets

*http://datascience.la/benchmarking-random-forest-implementations/

•  If your data fits on a single node…. •  Other high-performance options exist

*http://haifengl.github.io/smile/index.html

Ru

n t

ime

Page 18: Harnessing Big Data with Spark

18

Data set size

•  Large data lakes can consist of many small files

•  Memory per node increasing rapidly

*http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html

Page 19: Harnessing Big Data with Spark

19

NVDIMMS

•  Driving significant increases in node memory –  Up to 10X increase in density

•  Coming in late 2016…

Page 20: Harnessing Big Data with Spark

20

Hybrid operators

•  Time consuming to maintain multiple ML libraries & manually determine optimal choice

•  Develop hybrid implementations that automatically choose optimal approach –  Data set size –  Cluster size –  Cluster utilization

Page 21: Harnessing Big Data with Spark

21

Single-node performance (1/2)

*http://www.ayasdi.com/blog/LawrenceSpracklen

Page 22: Harnessing Big Data with Spark

22

Single-node performance (2/2)

*http://www.ayasdi.com/blog/LawrenceSpracklen

Page 23: Harnessing Big Data with Spark

23

Operationalization

•  What happens after the models are created? •  How does the business benefit from the

insights? •  Operationalization is frequently the weak link –  Operationalizing PowerPoint? –  Hand rolled scoring flows

Page 24: Harnessing Big Data with Spark

24

PFA

•  Portable Format for Analytics (PFA) •  Successor to PMML •  Significant flexibility in encapsulating complex

data preprocessing

Page 25: Harnessing Big Data with Spark

25

Conclusions

•  Spark delivers significant performance improvements over MR –  Can introduce more tuning requirements

•  Provides an opportunity for AutoML –  Automatically determine good solutions

•  Understand when its appropriate •  Don’t forget about about operationalization