Massive scale analytics with Stratosphere using R

35
Massive scale analytics with Stratosphere using R Jose Luis Lopez Pino [email protected] Database Systems and Information Management Technische Universit¨ at Berlin Supervised by Volker Markl Advised by Marcus Leich, Kostas Tzoumas August 28, 2014

description

In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm. This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples. In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems. Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.

Transcript of Massive scale analytics with Stratosphere using R

Page 1: Massive scale analytics with Stratosphere using R

Massive scale analytics with Stratosphere using R

Jose Luis Lopez [email protected]

Database Systems and Information ManagementTechnische Universitat Berlin

Supervised by Volker Markl

Advised by Marcus Leich, Kostas Tzoumas

August 28, 2014

v6

Page 2: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Introduction

Jose Luis Lopez Pino [email protected] 2

Page 3: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Data analysis to the masses

I Deep analytics1: sophisticated statistical methods like linearmodels, clustering or classification that frequently are used toextract knowledge from the data.

� Data warehousing and BI can’t answer all the questions.� The ever-growing number of new data sources and tools make

it worse.

I There is demand for this questions.

I In small scale: data pipelining tools (RapidMiner) andnumerical computing environments (R, Matlab or SPSS).

I Big data brings new opportunities to the market but alsopresents unfamiliar challenges.

1Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter JHaas, and John McPherson. Ricardo: integrating r and hadoop. In Proceedingsof the 2010 ACM SIGMOD International Conference on Management of data,pages 987–998. ACM, 2010

Jose Luis Lopez Pino [email protected] 3

Page 4: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Options

I R:� R is a numerical computing environment and DSL for stats.� Not a query language unlike SQL.� Succesful for small scale (in combination with CRAN

packages).

I MapReduce/Hadoop:� Highly parallel programs but lack of expressivity.� HDFS: a de-facto standard to store big amounts of data.

I Stratosphere:� Platform for massively parallel computing / big data analytics.� PACT: MapReduce + New operators + Iterations.

Jose Luis Lopez Pino [email protected] 4

Page 5: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Basic terms and definitions

I KDD is compound of nine steps: understanding the domainand the goals, creating the target source, cleaning andprocessing the source, data reduction and projection, choosinga data mining method, choosing the data mining algorithm,mining the data, interpretation of the patterns.

Figure: Overview of the process2

2Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kddprocess for extracting useful knowledge from volumes of data. Commun. ACM,39(11):27–34, November 1996

Jose Luis Lopez Pino [email protected] 5

Page 6: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Motivation

Jose Luis Lopez Pino [email protected] 6

Page 7: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Clustering

Jose Luis Lopez Pino [email protected] 7

Page 8: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Classification

Jose Luis Lopez Pino [email protected] 8

Page 9: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Frequent Terms

Jose Luis Lopez Pino [email protected] 9

Page 10: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Writing massively parallel programs

I It is a cumbersome and onerous process.I We need of single tools.I We need tools that can process from a small amount of data

up to very large volumes.I The majority of data researchers are strongly skilled in R and

statistics and poorly skills in Big Data systems andimplementation of machine learning algorithm.3 4

I Although Stratosphere offers a more expressive interface,writing a parallel program is still not a trivial job.

3Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers:An Introspective Survey of Data Scientists and Their Work. O’Reilly Media,Inc., 2013

4Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer.Enterprise data analysis and visualization: An interview study. Visualization andComputer Graphics, IEEE Transactions on, 18(12):2917–2926, 2012

Jose Luis Lopez Pino [email protected] 10

Page 11: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Relation with the KDD process

I Data extraction is covered by other solutions.

I Pre-processing and transformation seem difficult.

I Data mining: where we have a competitive advantage.

I Data visualization is a different problem.

Jose Luis Lopez Pino [email protected] 11

Page 12: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Design goals

I Easiness: ready-to-use algorithms.

I Design a library.

I Facilitate working with data.

I Easy to distribute.

I Focus on algorithms that scale.

Jose Luis Lopez Pino [email protected] 12

Page 13: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Our approach

Jose Luis Lopez Pino [email protected] 13

Page 14: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Architecture

Jose Luis Lopez Pino [email protected] 14

Page 15: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Architecture

Jose Luis Lopez Pino [email protected] 15

Page 16: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Library: Goals

I Classification, clustering and regression.

I No Free Lunch Theorem: more than one algorithm.

I Presence in other ML libraries.

I Large-scale.

I Ensemble scenarios.

Jose Luis Lopez Pino [email protected] 16

Page 17: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Library: Example

Jose Luis Lopez Pino [email protected] 17

Page 18: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

R package

I Easy to distribute.

I Organized in namespaces.

I Submitting jobs to the cluster.

I Working with files.

I Mining.

I Configuration.

Jose Luis Lopez Pino [email protected] 18

Page 19: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Example: Code

Jose Luis Lopez Pino [email protected] 19

Page 20: Massive scale analytics with Stratosphere using R

Example: Non-parallel classification example

Page 21: Massive scale analytics with Stratosphere using R

Example: Parallel classification example

Page 22: Massive scale analytics with Stratosphere using R

Example: Parallel clustering example

Page 23: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Performance

I Competitive and even faster than native R programs thanks tothe pipelining for every parallelizable programs in the same(small) file size range.

I Competitive with R for data mining tasks with a lot ofiterations in the same file size range.

I Able to process files of a volume that is inaccessible for R.

I Able to scale to gigabyte level without significant loss.

Jose Luis Lopez Pino [email protected] 23

Page 24: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Performance: Frequent Terms example

Jose Luis Lopez Pino [email protected] 24

Page 25: Massive scale analytics with Stratosphere using R

Performance: Most favorable case to R

Figure: KMeans 100 iterations

Page 26: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Performance: Breakdown example

Figure: Clustering example nonparallel breakdown (Time in seconds)

Jose Luis Lopez Pino [email protected] 26

Page 27: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Performance: Scalability example

Figure: Frequent Terms parallel scalabilityJose Luis Lopez Pino [email protected] 27

Page 28: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Related work

Jose Luis Lopez Pino [email protected] 28

Page 29: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Data mining libraries

I Don’t scale: Weka and sci-kit.I Large-scale:.

� Mahout: limited set of problems.� MLlib: also facilitates implementation of new algorithms.� Oryx.

I In-database: MADlib and PivotalR.

Jose Luis Lopez Pino [email protected] 29

Page 30: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Data intensive computation with R

I External memory.� Don’t scale-out: biglm, bigmemory, ff, foreach.� RevoScaleR: xdf files and Hadoop.

I Divide and recombine: it’s necessary to use the MR model.I Query languages:

� Limited expressivity.� Good for the first step of the KDD process.

I Distributed collection manipulation:� Limited set of operators.� Presto and SparkR.

Jose Luis Lopez Pino [email protected] 30

Page 31: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Conclusions and Future Work

Jose Luis Lopez Pino [email protected] 31

Page 32: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Conclusion

I Contributions:.� Library definition.� File manipulation and cluster interaction.� Scenarios that proof the concept.

I Code very similar to the original one.

I Promising performance evaluation.

Jose Luis Lopez Pino [email protected] 32

Page 33: Massive scale analytics with Stratosphere using R

Introduction Motivation Our approach Related work Conclusions and Future Work

Future work

I Improvements in the library.

I Hybrid approaches.

I Distributed evaluation.

I Improvements in the architecture.

Jose Luis Lopez Pino [email protected] 33

Page 34: Massive scale analytics with Stratosphere using R

Essential bibliography

I Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla,Peter J Haas, and John McPherson. Ricardo: integrating r andhadoop. In Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of data, pages 987–998. ACM, 2010

I Hadley Wickham. Advanced R Programming. CRC Press, 2014. Toappear

I Alexander Alexandrov, Rico Bergmann, Stephan Ewen,Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao,Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, MathiasPeters, Astrid Rheinlnder, MatthiasJ. Sax, Sebastian Schelter,Mareike Hger, Kostas Tzoumas, and Daniel Warneke. Thestratosphere platform for big data analytics. The VLDB Journal,pages 1–26, 2014

I Hai Qian. Pivotalr: A package for machine learning on big data. TheR Journal, 6(1):57–67, June 2014

I Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth.The kdd process for extracting useful knowledge from volumes ofdata. Commun. ACM, 39(11):27–34, November 1996

Page 35: Massive scale analytics with Stratosphere using R

Recap

1 IntroductionData analysis to themassesOptionsBasic terms and definitions

2 MotivationMotivating problemsWriting massively parallelprogramsRelation with the KDDprocessDesign goals

3 Our approach

ArchitectureLibraryR packageExamplePerformance

4 Related workData mining librariesData intensivecomputation with R

5 Conclusions and Future WorkConclusionFuture workEssential bibliography