Anatomy of Data Frame API : A deep dive into Spark Data Frame API
-
Upload
datamantra -
Category
Data & Analytics
-
view
139 -
download
2
Transcript of Anatomy of Data Frame API : A deep dive into Spark Data Frame API
![Page 1: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/1.jpg)
Anatomy of Data Frame API
A deep dive into the Spark Data Frame API
https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
![Page 2: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/2.jpg)
● Madhukara Phatak
● Big data consultant and trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
![Page 3: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/3.jpg)
Agenda
● Spark SQL library● Dataframe abstraction● Pig/Hive pipleline vs SparkSQL● Logical plan● Optimizer● Different steps in Query analysis
![Page 4: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/4.jpg)
Spark SQL library● Data source API
Universal API for Loading/ Saving structured data● DataFrame API
Higher level representation for structured data● SQL interpreter and optimizer
Express data transformation in SQL● SQL service
Hive thrift server
![Page 5: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/5.jpg)
Architecture of Spark SQL
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL
![Page 6: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/6.jpg)
DataFrame API● Single abstraction for representing structured data in
Spark● DataFrame = RDD + Schema (aka SchemaRDD)● All data source API’s return DataFrame● Introduced in 1.3● Inspired from R and Python panda● .rdd to convert to RDD representation resulting in RDD
[Row]● Support for DataFrame DSL in Spark
![Page 7: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/7.jpg)
Need for new abstraction● Single abstraction for structured data
○ Ability to combine data from multiple sources○ Uniform access from all different language API’s○ Ability to support multiple DSL’s
● Familiar interface to Data scientists○ Same API as R/ Panda○ Easy to convert from R local data frame to Spark○ New 1.4 SparkR is built around it
![Page 8: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/8.jpg)
Data Structure of structured world● Data Frame is a data structure to represent structured
data, whereas RDD is a data structure for unstructured data
● Having single data structure allows to build multiple DSL’s targeting different developers
● All DSL’s will be using same optimizer and code generator underneath
● Compare with Hadoop Pig and Hive
![Page 9: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/9.jpg)
Pig and Hive pipeline
HiveQL
Hive parser
Optimizer
Executor
Hive queries
Logical Plan
Optimized Logical Plan(M/R plan)
Physical Plan
Pig latin
Pig parser
Optimizer
Executor
Pig latin script
Logical Plan
Optimized Logical Plan(M/R plan)
Physical Plan
![Page 10: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/10.jpg)
Issue with Pig and Hive flow● Pig and hive shares a lot similar steps but independent
of each other● Each project implements it’s own optimizer and
executor which prevents benefiting from each other’s work
● There is no common data structure on which we can build both Pig and Hive dialects
● Optimizer is not flexible to accommodate multiple DSL’s● Lot of duplicate effort and poor interoperability
![Page 11: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/11.jpg)
Spark SQL pipeline HiveQL
Hive parser
Hive queries
SparkQL
SparkSQL Parser
Spark SQL queries
Dataframe DSL
DataFrame
Catalyst
Spark RDD code
![Page 12: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/12.jpg)
Spark SQL flow● Multiple DSL’s share same optimizer and executor● All DSL’s ultimately generate Dataframes● Catalyst is a new optimizer built from ground up for
Spark which is rule based framework● Catalyst allows developers to plug custom rules specific
to their DSL● You can plug your own DSL too!!
![Page 13: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/13.jpg)
What is a data frame?● Data frame is a container for Logical Plan● Logical Plan is a tree which represents data and
schema ● Every transformation is represented as tree
manipulation● These trees are manipulated and optimized by catalyst
rules● Logical plan will be converted to physical plan for
execution
![Page 14: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/14.jpg)
Explain Command● Explain command on dataframe allows us look at these
plans● There are three types of logical plans
○ Parsed logical plan○ Analysed Logical Plan○ Optimized logical Plan
● Explain also shows Physical plan● DataFrameExample.scala
![Page 15: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/15.jpg)
Filter example● In last example, all plans looked same as there were no
dataframe operations● In this example, we are going to apply two filters on the
data frame● Observe generated optimized plan● Example : FilterExampleTree.scala
![Page 16: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/16.jpg)
Optimized Plan● Optimized plan normally allows spark to plug in set of
optimization rules ● In our example, When multiple filters are added, spark
&& them for better performance● Even developer can plug in his/her own rules to
optimizer
![Page 17: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/17.jpg)
Accessing Plan trees● Every dataframe is attached with queryExecution object
which allows us to access these plans individually.● We can access plans as follows
○ parsed plan - queryExecution.logical○ Analysed - queryExecution.analyzed○ Optimized - queryExecution.optimizedPlan
● numberedTreeString on the plan allows us to see the hierarchy
● Example : FilterExampleTree.scala
![Page 18: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/18.jpg)
Filter tree representation
02 LogicalRDD [c1#0,c2#1,c3#2,c4#3]
01 Filter NOT (CAST(c1#0, DoubleType) = CAST(0, DoubleType))
00 Filter NOT (CAST(c2#0, DoubleType) = CAST(0, DoubleType))
02 LogicalRDD [c1#0,c2#1,c3#2,c4#3]
Filter (NOT (CAST(c1#0, DoubleType) = 0.0) && NOT (CAST(c2#1, DoubleType) = 0.0))
![Page 19: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/19.jpg)
Manipulating Trees● Every optimization in spark-sql is implemented as a tree
or logical transformation● Series of these transformation allows for modular
optimizer● All tree manipulations are done using scala case class● As developer we can write these manipulations too● Let’s create an OR filter rather than and● OrFilter.scala
![Page 20: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/20.jpg)
Understanding steps in plan ● Logical plan goes through series of rules to resolve and
optimize plan● Each plan is a Tree manipulation we seen before● We can apply series of rules to see how a given plan
evolves over time● This understanding allows us to understand how to
tweak given query for better performance● Ex : StepsInQueryPlanning.scala
![Page 21: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/21.jpg)
Query
select a.customerId from ( select customerId , amountPaid as amount from sales where 1 = '1') a where amount=500.0
![Page 22: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/22.jpg)
Parsed Plan● This is plan generated after parsing the DSL● Normally these plans generated by the specific parsers
like HiveQL parser, Dataframe DSL parser etc● Usually they recognize the different transformations and
represent them in the tree nodes● It’s a straightforward translation without much tweaking ● This will be fed to analyser to generate analysed plan
![Page 23: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/23.jpg)
Parsed Logical Plan
UnResolvedRelationSales
`Filter(1 = 1)
`Projection'customerId,'amountPaid
`SubQuerya
`Filter(amount = 500)
`Projecta.customerId
![Page 24: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/24.jpg)
Analyzed plan● We use sqlContext.analyser access the rules to
generate analyzed plan● These rules has to be run in sequence to resolve
different entities in the logical plan● Different entities to be resolved is
○ Relations ( aka Table)○ References Ex : Subquery, aliases etc○ Data type casting
![Page 25: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/25.jpg)
ResolveRelations Rule● This rule resolves all the relations ( tables) specified in
the plan
● Whenever it finds a new unresolved relation, it consults catalyst aka registerTempTable list
● Once it finds the relation, it resolves that with actual relationship
![Page 26: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/26.jpg)
Resolved Relation Logical Plan
JsonRelation Sales[amountPaid..]
Filter(1 = 1)
`Projection'customerId,'amountPaid
`SubQuerya
`Filter(amount = 500)
`Projecta.customerId
SubQuery - salesUnResolvedRelation
Sales
`Filter(1 = 1)
`Projection'customerId,'amountPaid
`SubQuerya
`Filter(amount = 500)
`Projecta.customerId
![Page 27: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/27.jpg)
ResolveReferences● This rule resolves all the references in the Plan
● All aliases and column names get a unique number which allows parser to locate them irrespective of their position
● This unique numbering allows subqueries to removed for better optimization
![Page 28: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/28.jpg)
Resolved References Plan
JsonRelation Sales[amountPaid#0..]
`Filter(1 = 1)
ProjectioncustomerId#1L,amountPaid#0
SubQuerya
Filter(amount#4 = 500)
ProjectcustomerId#1L
SubQuery - sales
JsonRelation Sales[amountPaid..]
`Filter(1 = 1)
`Projection'customerId,'amountPaid
`SubQuerya
`Filter(amount = 500)
`Projecta.customerId
SubQuery - sales
![Page 29: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/29.jpg)
PromoteString● This rule allows analyser to promote string to right data
types
● In our query, Filter( 1=’1’) we are comparing a double with string
● This rule puts a cast from string to double to have the right semantics
![Page 30: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/30.jpg)
Promote String Plan
JsonRelation Sales[amountPaid#0..]
`Filter(1 = CAST(1, DoubleType))
ProjectioncustomerId#1L,amountPaid#0
SubQuerya
Filter(amount#4 = 500)
ProjectcustomerId#1L
SubQuery - sales
JsonRelation Sales[amountPaid#0..]
`Filter(1 = 1)
ProjectioncustomerId#1L,amountPaid#0
SubQuerya
Filter(amount#4 = 500)
ProjectcustomerId#1L
SubQuery - sales
![Page 31: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/31.jpg)
Optimize
![Page 32: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/32.jpg)
Eliminate Subqueries● This rule allows analyser to eliminate superfluous sub
queries
● This is possible as we have unique identifier for each of the references
● Removal of sub queries allows us to do advanced optimization in subsequent steps
![Page 33: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/33.jpg)
Eliminate subqueries
JsonRelation Sales[amountPaid#0..]
`Filter(1 = CAST(1, DoubleType))
ProjectioncustomerId#1L,amountPaid#0
Filter(amount#4 = 500)
ProjectcustomerId#1L
JsonRelation Sales[amountPaid#0..]
`Filter(1 = CAST(1, DoubleType))
ProjectioncustomerId#1L,amountPaid#0
SubQuerya
Filter(amount#4 = 500)
ProjectcustomerId#1L
SubQuery - sales
![Page 34: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/34.jpg)
Constant Folding● Simplifies expressions which result in constant values
● In our plan, Filter(1=1) always results in true
● So constant folding replaces it in true
![Page 35: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/35.jpg)
ConstantFoldingPlan
JsonRelation Sales[amountPaid#0..]
`FilterTrue
ProjectioncustomerId#1L,amountPaid#0
Filter(amount#4 = 500)
ProjectcustomerId#1L
JsonRelation Sales[amountPaid#0..]
`Filter(1 = CAST(1, DoubleType))
ProjectioncustomerId#1L,amountPaid#0
Filter(amount#4 = 500)
ProjectcustomerId#1L
![Page 36: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/36.jpg)
Simplify Filters● This rule simplifies filters by
○ Removes always true filters○ Removes entire plan subtree if filter is false
● In our query, the true Filter will be removed
● By simplifying filters, we can avoid multiple iterations on data
![Page 37: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/37.jpg)
Simplify Filter Plan
JsonRelation Sales[amountPaid#0..]
ProjectioncustomerId#1L,amountPaid#0
Filter(amount#4 = 500)
ProjectcustomerId#1L
JsonRelation Sales[amountPaid#0..]
`FilterTrue
ProjectioncustomerId#1L,amountPaid#0
Filter(amount#4 = 500)
ProjectcustomerId#1L
![Page 38: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/38.jpg)
PushPredicateThroughFilter● It’s always good to have filters near to the data source for better optimizations ● This rules pushes the filters near to the JsonRelation● When we rearrange the tree nodes, we need to make
sure we rewrite the rule match the aliases● In our example, the filter rule is rewritten to use alias
amountPaid rather than amount
![Page 39: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/39.jpg)
PushPredicateThroughFilter Plan
JsonRelation Sales[amountPaid#0..]
Filter(amountPaid#0 = 500)
ProjectioncustomerId#1L,amountPaid#0
ProjectcustomerId#1L
JsonRelation Sales[amountPaid#0..]
ProjectioncustomerId#1L,amountPaid#0
Filter(amount#4 = 500)
ProjectcustomerId#1L
![Page 40: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/40.jpg)
Project Collapsing● Removes unnecessary projects from the plan● In our plan , we don’t need second projection, i.e
customerId, amount Paid as we only require one projection i.e customerId
● So we can get rid of the second projection● This gives us most optimized plan
![Page 41: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/41.jpg)
Project Collapsing Plan
JsonRelation Sales[amountPaid#0..]
Filter(amountPaid#0 = 500)
ProjectioncustomerId#1L,amountPaid#0
ProjectcustomerId#1L
JsonRelation Sales[amountPaid#0..]
Filter(amountPaid#0 = 500)
ProjectcustomerId#1L
![Page 42: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/42.jpg)
Generating Physical Plan● Catalyser can take a logical plan and turn into a
physical plan or Spark plan● On queryExecutor, we have a plan called executedPlan
which gives us physical plan● On physical plan, we can call executeCollect or
executeTake to start evaluating the plan
![Page 43: Anatomy of Data Frame API : A deep dive into Spark Data Frame API](https://reader030.fdocuments.us/reader030/viewer/2022032620/55cdd55bbb61eb9a518b4586/html5/thumbnails/43.jpg)
References● https://www.youtube.com/watch?v=GQSNJAzxOr8
● https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
● http://spark.apache.org/sql/