Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer...
Transcript of Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer...
![Page 1: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/1.jpg)
Intro to PySpark WorkshopGarren StaubliSr. Data Engineer@gstaubli
Resources: garrens.com/pyspark124#PySparkWorkshop
![Page 2: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/2.jpg)
Working with Spark since 2015• Batch analytics in Spark + Hive, Pig
and Hadoop MapReduce• Real-time big data reporting
using Spark/Impala/CDH• Spark Structured Streaming + ML apps
for real-time decision making
2
Do I know what I’m talking about?
Resources: garrens.com/pyspark124
50+ answers on
for Spark
#PySparkWorkshop
![Page 3: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/3.jpg)
3
Main Points
• Apache Spark• Sample App Walkthrough• Interactive Azure Jupyter
Notebook• Python-specific Spark advice• Resources to continue learning
Resources: garrens.com/pyspark124#PySparkWorkshop
![Page 4: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/4.jpg)
4
About Apache Spark
Resources: garrens.com/pyspark124
Structured Spark.ML GraphFrame
Lazily Evaluated• Transforms vs Actions
Immutable
#PySparkWorkshop
![Page 5: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/5.jpg)
5
About Apache Spark
Resources: garrens.com/pyspark124#PySparkWorkshop
![Page 6: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/6.jpg)
6
Spark application (Driver)spark = SparkSession.builder\
.appName(name="PySpark Intro")\
.master("local[*]")\
.getOrCreate()
Master (Cluster Manager)
Slave (Worker)
detailed architecture
Executor
Task Task
Slave (Worker)
Executor
Task Task
Slave (Worker)
Executor
Task Task
SparkSession
Resources: garrens.com/pyspark124#PySparkWorkshop
![Page 7: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/7.jpg)
7
About Apache Spark | Spark SQL
Resources: garrens.com/pyspark124
SQL is not about SQLis about more than SQL
#PySparkWorkshop
![Page 8: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/8.jpg)
8
About Apache Spark | 2 Kinds of Actions
Resources: garrens.com/pyspark124#PySparkWorkshop
![Page 9: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/9.jpg)
9
About Apache Spark | Modern vs Legacy
Resources: garrens.com/pyspark124#PySparkWorkshop
![Page 10: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/10.jpg)
10
About Apache Spark | Modern Optimization
Resources: garrens.com/pyspark124#PySparkWorkshop
![Page 11: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/11.jpg)
11
About Apache Spark | Planning
Resources: garrens.com/pyspark124#PySparkWorkshop
![Page 12: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/12.jpg)
12
Walkthrough | Create Spark Session
Resources: garrens.com/pyspark124
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName(name="PySpark Intro")\
.master("local[*]")\
.getOrCreate()
Deploy modes: Local, standalone, YARN, Mesos and Kubernetes
#PySparkWorkshop
![Page 13: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/13.jpg)
13
Walkthrough | Read CSV into DataFrame
Resources: garrens.com/pyspark124
green_trips = spark.read\ .option("header", "true")\ .option("inferSchema", "true")\ .csv("green_tripdata_2017-06.csv")
Forces eager evaluation; default is false
#PySparkWorkshop
![Page 14: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/14.jpg)
14
Walkthrough | Behind the Scenes: UI
Resources: garrens.com/pyspark124#PySparkWorkshop
![Page 15: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/15.jpg)
15
Walkthrough | Behind the Scenes: UI
Resources: garrens.com/pyspark124#PySparkWorkshop
![Page 16: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/16.jpg)
16
Walkthrough | DataFrame Schema
Resources: garrens.com/pyspark124
green_trips.printSchema()Eagerly evaluated (inferSchema = true) Lazily evaluated (inferSchema = false)
#PySparkWorkshop
![Page 17: Intro to PySpark Workshop - garrens.comIntro to PySpark Workshop Garren Staubli Sr. Data Engineer @gstaubli ... • Interactive Azure Jupyter Notebook • Python-specific Spark advice](https://reader030.fdocuments.us/reader030/viewer/2022041015/5ec63490efbf28749963adc5/html5/thumbnails/17.jpg)
• 2014• 2015 #1• 2016 #1• 2017 #4
• 2015 #1• 2016 #1
• 2014• 2015• 2016
• 2016
• 2015 #373• 2016 #166• 2017 #161
1717
You guessed it… We’re hiring!
Resources: garrens.com/pyspark124#PySparkWorkshop