Massive scale analytics with Stratosphere using R

Massive scale analytics with Stratosphere using R

Jose Luis Lopez [email protected]

Database Systems and Information ManagementTechnische Universitat Berlin

Supervised by Volker Markl

Advised by Marcus Leich, Kostas Tzoumas

August 28, 2014

v6

Introduction Motivation Our approach Related work Conclusions and Future Work

Introduction

Jose Luis Lopez Pino [email protected] 2


Data analysis to the masses

I Deep analytics1: sophisticated statistical methods like linearmodels, clustering or classification that frequently are used toextract knowledge from the data.

� Data warehousing and BI can’t answer all the questions.� The ever-growing number of new data sources and tools make

it worse.

I There is demand for this questions.

I In small scale: data pipelining tools (RapidMiner) andnumerical computing environments (R, Matlab or SPSS).

I Big data brings new opportunities to the market but alsopresents unfamiliar challenges.

1Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter JHaas, and John McPherson. Ricardo: integrating r and hadoop. In Proceedingsof the 2010 ACM SIGMOD International Conference on Management of data,pages 987–998. ACM, 2010



Options

I R:� R is a numerical computing environment and DSL for stats.� Not a query language unlike SQL.� Succesful for small scale (in combination with CRAN

packages).

I MapReduce/Hadoop:� Highly parallel programs but lack of expressivity.� HDFS: a de-facto standard to store big amounts of data.

I Stratosphere:� Platform for massively parallel computing / big data analytics.� PACT: MapReduce + New operators + Iterations.



Basic terms and definitions

I KDD is compound of nine steps: understanding the domainand the goals, creating the target source, cleaning andprocessing the source, data reduction and projection, choosinga data mining method, choosing the data mining algorithm,mining the data, interpretation of the patterns.

Figure: Overview of the process2

2Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The kddprocess for extracting useful knowledge from volumes of data. Commun. ACM,39(11):27–34, November 1996



Motivation



Clustering



Classification



Frequent Terms



Writing massively parallel programs

I It is a cumbersome and onerous process.I We need of single tools.I We need tools that can process from a small amount of data

up to very large volumes.I The majority of data researchers are strongly skilled in R and

statistics and poorly skills in Big Data systems andimplementation of machine learning algorithm.3 4

I Although Stratosphere offers a more expressive interface,writing a parallel program is still not a trivial job.

3Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers:An Introspective Survey of Data Scientists and Their Work. O’Reilly Media,Inc., 2013

4Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer.Enterprise data analysis and visualization: An interview study. Visualization andComputer Graphics, IEEE Transactions on, 18(12):2917–2926, 2012



Relation with the KDD process

I Data extraction is covered by other solutions.

I Pre-processing and transformation seem difficult.

I Data mining: where we have a competitive advantage.

I Data visualization is a different problem.



Design goals

I Easiness: ready-to-use algorithms.

I Design a library.

I Facilitate working with data.

I Easy to distribute.

I Focus on algorithms that scale.



Our approach



Architecture



Library: Goals

I Classification, clustering and regression.

I No Free Lunch Theorem: more than one algorithm.

I Presence in other ML libraries.

I Large-scale.

I Ensemble scenarios.



Library: Example



R package

I Easy to distribute.

I Organized in namespaces.

I Submitting jobs to the cluster.

I Working with files.

I Mining.

I Configuration.



Example: Code


Example: Non-parallel classification example

Example: Parallel classification example

Example: Parallel clustering example


Performance

I Competitive and even faster than native R programs thanks tothe pipelining for every parallelizable programs in the same(small) file size range.

I Competitive with R for data mining tasks with a lot ofiterations in the same file size range.

I Able to process files of a volume that is inaccessible for R.

I Able to scale to gigabyte level without significant loss.



Performance: Frequent Terms example


Performance: Most favorable case to R

Figure: KMeans 100 iterations


Performance: Breakdown example

Figure: Clustering example nonparallel breakdown (Time in seconds)



Performance: Scalability example

Figure: Frequent Terms parallel scalabilityJose Luis Lopez Pino [email protected] 27


Related work



Data mining libraries

I Don’t scale: Weka and sci-kit.I Large-scale:.

� Mahout: limited set of problems.� MLlib: also facilitates implementation of new algorithms.� Oryx.

I In-database: MADlib and PivotalR.



Data intensive computation with R

I External memory.� Don’t scale-out: biglm, bigmemory, ff, foreach.� RevoScaleR: xdf files and Hadoop.

I Divide and recombine: it’s necessary to use the MR model.I Query languages:

� Limited expressivity.� Good for the first step of the KDD process.

I Distributed collection manipulation:� Limited set of operators.� Presto and SparkR.



Conclusions and Future Work



Conclusion

I Contributions:.� Library definition.� File manipulation and cluster interaction.� Scenarios that proof the concept.

I Code very similar to the original one.

I Promising performance evaluation.



Future work

I Improvements in the library.

I Hybrid approaches.

I Distributed evaluation.

I Improvements in the architecture.


Essential bibliography

I Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla,Peter J Haas, and John McPherson. Ricardo: integrating r andhadoop. In Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of data, pages 987–998. ACM, 2010

I Hadley Wickham. Advanced R Programming. CRC Press, 2014. Toappear

I Alexander Alexandrov, Rico Bergmann, Stephan Ewen,Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao,Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, MathiasPeters, Astrid Rheinlnder, MatthiasJ. Sax, Sebastian Schelter,Mareike Hger, Kostas Tzoumas, and Daniel Warneke. Thestratosphere platform for big data analytics. The VLDB Journal,pages 1–26, 2014

I Hai Qian. Pivotalr: A package for machine learning on big data. TheR Journal, 6(1):57–67, June 2014

I Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth.The kdd process for extracting useful knowledge from volumes ofdata. Commun. ACM, 39(11):27–34, November 1996

Recap

1 IntroductionData analysis to themassesOptionsBasic terms and definitions

2 MotivationMotivating problemsWriting massively parallelprogramsRelation with the KDDprocessDesign goals

3 Our approach

ArchitectureLibraryR packageExamplePerformance

4 Related workData mining librariesData intensivecomputation with R

5 Conclusions and Future WorkConclusionFuture workEssential bibliography

Massive scale analytics with Stratosphere using R

Technology

Transcript of Massive scale analytics with Stratosphere using R