USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING...

29
DATASETS | DATAFRAMES | SQL USING SPARK EFFICIENTLY NAME: GARREN STAUBLI CONSULTANT @ BLUEPRINT CONSULTING SERVICES SPECIALTY: BIG DATA ENGINEERING 1 © 2017 Garren Staubli | Presentation resources: garrens.com/spark729

Transcript of USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING...

Page 1: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

DATASETS | DATAFRAMES | SQLUSING SPARK EFFICIENTLY

NAME: GARREN STAUBLI

CONSULTANT @ BLUEPRINT CONSULTING SERVICES

SPECIALTY: BIG DATA ENGINEERING

1 © 2017 Garren Staubli | Presentation resources: garrens.com/spark729

Page 2: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

DISCLAIMERS

2 © 2017 Garren Staubli | Presentation resources: garrens.com/spark729

I am not a Robot.

Non-OCD friendly formatting may follow.

The views expressed in this presentation probably belong to various different companies to whom I have sold my soul unwittingly by mindlessly accepting Terms of Service and Use.

Page 3: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

USING SPARK EFFICIENTLY

3

Spark Core:• RDD (Resilient Distributed Dataset)

Spark SQL:• Dataset (e.g. Dataset[CountryGDP])

• Typed templates• Matching schema• Relational and functional operations• Lazily Evaluated

• DataFrame (aka Dataset[Row])• Uses generic Row object

DISTRIBUTED DATA STRUCTURES

© 2017 Garren Staubli | Presentation resources: garrens.com/spark729

Page 4: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark7294

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Optimized

Parallelized

Immutable

Encoded

Strongly Typed

DATASETS

O(PIES)“A DATASET IS A STRONGLY

TYPED COLLECTION OF DOMAIN-SPECIFIC OBJECTS THAT CAN BE

TRANSFORMED IN PARALLEL USING FUNCTIONAL OR

RELATIONAL OPERATIONS.”

Page 5: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark7295

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

DATASETSO(PIES)

OptimizedStorage TungstenExecution Catalyst

Page 6: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark7296

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Optimized [Storage] => TungstenDATASETS

O(PIES)

“JAVA OBJECTS HAVE A LARGE INHERENT MEMORY OVERHEAD.”

Page 7: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark7297

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Optimized [Storage] => TungstenDATASETS

O(PIES)

Tungsten Objectives:

1. Improve CPU and Memory efficiency

2. Push performance closer to bare metal

Page 8: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark7298

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Optimized [Storage] => TungstenDATASETS

O(PIES)

Memory Management and Binary Processing• Manage memory explicitly • Eliminate JVM object model overhead• Reduce garbage collection

Cache-aware computation• Algorithms and data structures to exploit memory hierarchy

Code generation• Using code generation to exploit modern compilers and CPUs

Page 9: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark7299

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Optimized [Execution] => CatalystDATASETS

O(PIES)

YOUR REQUEST

SPARK CATALYST

EXECUTION PLANNING

GENERATED CODE

Page 10: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72910

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Optimized [Execution] => CatalystDATASETS

O(PIES)

YOUR REQUEST

Page 11: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72911

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Optimized [Execution] => CatalystDATASETS

O(PIES)

SPARK CATALYST

Page 12: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72912

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Optimized [Execution] => CatalystDATASETS

O(PIES)

EXECUTION PLANNING

GENERATED CODE

Page 13: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72913

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Optimized [Execution] => CatalystDATASETS

O(PIES)

Page 14: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72914

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

ParallelizedDATASETS

BEFORE AFTER

O(PIES)

Page 15: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72915

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

ParallelizedDATASETS

O(PIES)

Partitions• Logical splits of data

• 64MB like HDFS Blocks

Tasks• Individual unit of execution

Page 16: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72916

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

ParallelizedDATASETS

O(PIES)

Page 17: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72917

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

ImmutableDATASETS

O(PIES)

Page 18: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72918

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

ImmutableDATASETS

O(PIES)

Page 19: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72919

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

EncodedDATASETS

O(PIES)

Data validation

Page 20: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72920

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

EncodedDATASETS

O(PIES)

Page 21: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72921

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Strongly TypedDATASETS

O(PIES)

Page 22: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

© 2017 Garren Staubli | Presentation resources: garrens.com/spark72922

org.apache.spark.sql.Dataset[T]

USING SPARK EFFICIENTLY

Strongly TypedDATASETS

O(PIES)

Page 23: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

23

• Named Columns

• Weakly typed Row representation

• Similar structures

• Python/R DataFrame

• Relational DB table

USING SPARK EFFICIENTLY

© 2017 Garren Staubli | Presentation resources: garrens.com/spark729

org.apache.spark.sql.Dataset[Row]DATAFRAMES

Page 24: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

24

USING SPARK EFFICIENTLY

DISTRIBUTED DATA STRUCTURES RECAP

© 2017 Garren Staubli | Presentation resources: garrens.com/spark729

Page 25: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

25

DataFrame

• Exploratory Data Analysis

• Transforming Columns

USING SPARK EFFICIENTLY

© 2017 Garren Staubli | Presentation resources: garrens.com/spark729

org.apache.spark.sql.Dataset[Row]USE DATASET OR DATAFRAME?

Dataset

• Known object requirements

• Strong typing in IDE

• Needs compilation errors

Page 26: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

26

USING SPARK EFFICIENTLY

© 2017 Garren Staubli | Presentation resources: garrens.com/spark729

CACHINGCaching best practices

- When

- Fits in memory

- Computationally expensive

- Broadcast hint

- Eager vs Lazy (default) cache

Page 27: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

27

USING SPARK EFFICIENTLY

© 2017 Garren Staubli | Presentation resources: garrens.com/spark729

USING SQL

UDF Support • Spark • Hive

Create Permanent Hive table

Access Hive/Temporary Table

Page 28: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

28

USING SPARK EFFICIENTLY

© 2017 Garren Staubli | Presentation resources: garrens.com/spark729

https://i.ytimg.com/vi/p8MEiZGjETE/maxresdefault.jpghttps://en.wikipedia.org/wiki/Donald_Trumphttp://why-not-learn-something.blogspot.com/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.htmlhttp://sueduff.com/wp-content/uploads/2014/04/bigstock-Magic-Book-9990005.jpg

CREDITS

Images

Databricks: Datasets: https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.htmlCatalyst: https://www.slideshare.net/databricks/a-deep-dive-into-spark-sqls-catalyst-optimizer-with-yin-huaihttps://de.slideshare.net/SparkSummit/deep-dive-into-catalyst-apache-spark-20s-optimizer-63071120Tungsten: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.htmlMatrix: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.htmlGeneral Reference https://jaceklaskowski.gitbooks.io/mastering-apache-spark/https://spark.apache.org/docs/latest/sql-programming-guide.html

Page 29: USING SPARK EFFICIENTLY - garrens.com€¦ · 29/07/2017  · org.apache.spark.sql.Dataset[T] USING SPARK EFFICIENTLY Optimized [Storage] => Tungsten DATASETS O(PIES) Tungsten Objectives:

29

USING SPARK EFFICIENTLY

© 2017 Garren Staubli | Presentation resources: garrens.com/spark729

https://seesouthernstars.files.wordpress.com/2015/06/porkey.jpg