Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and...
Transcript of Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and...
![Page 1: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/1.jpg)
Enabling Exploratory Data Science with Spark and R Shivaram Venkataraman, Hossein Falaki (@mhfalaki)
![Page 2: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/2.jpg)
About Apache Spark, AMPLab and Databricks
Apache Spark is a general distributed computing engine that unifies: • Real-time streaming (Spark Streaming) • Machine learning (SparkML/MLLib) • SQL (SparkSQL) • Graph processing (GraphX)
AMPLab (Algorithms, Machines, and Peoples lab) at UC Berkeley was where Spark and SparkR were developed originally. Databricks Inc. is the company founded by creators of Spark, focused on making big data
simple by offering an end to end data processing platform in the cloud 2
![Page 3: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/3.jpg)
What is R?
Language and runtime The corner stone of R is the
data frame concept
3
![Page 4: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/4.jpg)
Many data scientists love R
4
• Open source • Highly dynamic • Interactive environment • Rich ecosystem of packages • Powerful visualization infrastructure • Data frames make data manipulation convenient • Taught by many schools to stats and computing students
![Page 5: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/5.jpg)
Performance Limitations of R
R language • R’s dynamic design imposes restrictions on optimization R runtime • Single threaded • Everything has to fit in memory
5
![Page 6: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/6.jpg)
What would be ideal?
Seamless manipulation and analysis of very large data in R • R’s flexible syntax • R’s rich package ecosystem • R’s interactive environment • Scalability (scale up and out) • Integration with distributed data sources / storage
6
![Page 7: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/7.jpg)
Augmenting R with other frameworks
In practice data scientists use R in conjunction with other frameworks (Hadoop MR, Hive, Pig, Relational Databases, etc)
7
FrameworkX(LanguageY)
DistributedStorage
1.Load,clean,transform,aggregate,sample
LocalStorage
2.Savetolocalstorage 3.ReadandanalyzeinRIterate
![Page 8: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/8.jpg)
What is SparkR?
An R package distributed with Apache Spark: • Provides R frontend to Spark • Exposes Spark Dataframes (inspired by R and Pandas) • Convenient interoperability between R and Spark DataFrames
8
+distributed/robustprocessing,datasources,off-memorydatastructures
Spark
Dynamicenvironment,interacJvity,packages,visualizaJon
R
![Page 9: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/9.jpg)
How does SparkR solve our problems?
No local storage involved Write everything in R Use Spark’s distributed cache for interactive/iterative analysis at
speed of thought 9
LocalStorage
2.Savetolocalstorage 3.ReadandanalyzeinR
FrameworkX(LanguageY)
DistributedStorage
1.Load,clean,transform,aggregate,sample
Iterate
![Page 10: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/10.jpg)
Example SparkR program
# Loading distributed data df <- read.df(“hdfs://bigdata/logs”, source = “json”) # Distributed filtering and aggregation errors <- subset(df, df$type == “error”) counts <- agg(groupBy(errors, df$code), num = count(df$code)) # Collecting and plotting small data qplot(code, num, data = collect(counts), geom = “bar”, stat =
“identity”) + coord_flip()
10
![Page 11: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/11.jpg)
SparkR architecture
11
SparkDriver
R JVM
RBackend
JVM
Worker
JVM
Worker
DataSources
![Page 12: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/12.jpg)
Overview of SparkR API
IO • read.df / write.df • createDataFrame / collect
Caching • cache / persist / unpersist • cacheTable / uncacheTable
Utility functions • dim / head / take • names / rand / sample / ...
12
ML Lib • glm / predict
DataFrame API select / subset / groupBy head / showDF /unionAll agg / avg / column / ...
SQL sql / table / saveAsTable registerTempTable / tables
![Page 13: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/13.jpg)
Moving data between R and JVM
13
R JVM
RBackendSparkR::collect()
SparkR::createDataFrame()
![Page 14: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/14.jpg)
Moving data between R and JVM
14
R JVM
RBackend
JVM
Worker
JVM
Worker
HDFS/S3/…
FUSE
read.df()write.df()
![Page 15: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/15.jpg)
Moving between languages
15
R Scala
Spark
df <- read.df(...) wiki <- filter(df, ...) registerTempTable(wiki, “wiki”)
val wiki = table(“wiki”) val parsed = wiki.map { Row(_, _, text: String, _, _) =>text.split(‘ ’) } val model = Kmeans.train(parsed)
![Page 16: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/16.jpg)
Mixing R and SQL
Pass a query to SQLContext and get the result back as a DataFrame
16
# Register DataFrame as a table registerTempTable(df, “dataTable”) # Complex SQL query, result is returned as another DataFrame aggCount <- sql(sqlContext, “select count(*) as num, type, date
group by type order by date desc”) qplot(date, num, data = collect(aggCount), geom = “line”)
![Page 17: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/17.jpg)
SparkR roadmap and upcoming features
• Exposing MLLib functionality in SparkR • GLM already exposed with R formula support
• UDF support in R • Distribute a function and data • Ideal way for distributing existing R functionality and packages
• Complete DataFrame API to behave/feel just like data.frame
17
![Page 18: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/18.jpg)
Example use case: exploratory analysis
• Data pipeline implemented in Scala/Python • New files are appended to existing data partitioned by time • Table scheme is saved in Hive metastore • Data scientists use SparkR to analyze and visualize data
1. refreshTable(sqlConext, “logsTable”) 2. logs <- table(sqlContext, “logsTable”) 3. Iteratively analyze/aggregate/visualize using Spark & R DataFrames 4. Publish/share results
18
![Page 19: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/19.jpg)
Demo
19
![Page 20: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/20.jpg)
How to get started with SparkR?
• On your computer 1. Download latest version of Spark (1.5.2) 2. Build (maven or sbt) 3. Run ./install-dev.sh inside the R directory 4. Start R shell by running ./bin/sparkR
• Deploy Spark (1.4+) on your cluster • Sign up for 14 days free trial at Databricks
20
![Page 21: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/21.jpg)
Summary
1. SparkR is an R frontend to Apache Spark 2. Distributed data resides in the JVM 3. Workers are not running R process (yet) 4. Distinction between Spark DataFrames and R data frames
21
![Page 22: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/22.jpg)
Further pointers
http://spark.apache.org http://www.r-project.org http://www.ggplot2.org https://cran.r-project.org/web/packages/magrittr www.databricks.com Office hour: 13-14 Databricks Booth
22
![Page 23: Enabling Exploratory Data Science with Spark and R · 2020-03-15 · About Apache Spark, AMPLab and Databricks Apache Spark is a general distributed computing engine that unifies:](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec561c23cf24a3c663ddfb7/html5/thumbnails/23.jpg)
Thank you