Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron Hu and Zhenhua Wang
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API
-
Upload
chris-fregly -
Category
Software
-
view
3.031 -
download
2
Transcript of Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Data Sources API
IBM | spark.tc
Advanced Apache Spark MeetupSpark SQL + DataFrames + Catalyst + Data Sources
APIChris Fregly, Principal Data Solutions Engineer
IBM Spark Technology CenterSept 21, 2015
Power of data. Simplicity of design. Speed of innovation.
Meetup Housekeeping
IBM | spark.tc
Announcements
Patrick McFadin, Evange-list
DataStax
Steve Beier, Boss ManIBM Spark Tech Center
IBM | spark.tc
Who am I?Streaming Platform EngineerNot a Photographer or Model
Streaming Data EngineerNetflix Open Source Committer
Data Solutions EngineerApache Contributor
Principal Data Solutions Engineer
IBM Technology Center
IBM | spark.tc
Last Meetup (Spark Wins 100 TB Daytona GraySort) On-disk only, in-memory caching
disabled!sortbenchmark.org/ApacheSpark2014.pdf
IBM | spark.tc
Meetup MetricsTotal Spark Experts: ~1000 (+20%)Mean RSVPs per Meetup: ~300Mean Attendance: ~50-60% of RSVPs
Donations: $0 (-100%)This is good!“Your money is no good here.”
Lloyd from The Shining<--- eek!
IBM | spark.tc
Meetup UpdatesTalking with other Spark Meetup Groups
Potential mergers and/or hostile takeovers!New Sponsors!!
Looking for more South Bay/Peninsula HostsRequired: Food, Beer/Soda/Water, Air ConditioningOptional: A/V Recording and Live Stream
We’re trying out new PowerPoint AnimationsPlease be patient!
IBM | spark.tc
Constructive Criticism from Previous Attendees“Chris, you’re like a fat version of an already-fat Erlich from Silicon Valley - except not funny.”
“Chris, your voice is so annoying that it actually woke me from the sleep induced by your boring content.”
IBM | spark.tc
Freg-a-palooza Upcoming World Tour① New York Strata (Sept 29th – Oct 1st)② London Spark Meetup (Oct 12th)③ Scotland Data Science Meetup (Oct 13th)④ Dublin Spark Meetup (Oct 15th)⑤ Barcelona Spark Meetup (Oct 20th)⑥ Madrid Spark Meetup (Oct 22nd)⑦ Amsterdam Spark Summit (Oct 27th – Oct 29th)⑧ Delft Dutch Data Science Meetup (Oct 29th) ⑨ Brussels Spark Meetup (Oct 30th)⑩ Zurich Big Data Developers Meetup (Nov 2nd)
High probabilityI’ll end up in jail
IBM | spark.tc
Topics of this Talk① DataFrames② Catalyst Optimizer and Query Plans③ Data Sources API④ Creating and Contributing Custom Data
Source⑤ Partitions, Pruning, Pushdowns⑥ Native + Third-Party Data Source Impls⑦ Spark SQL Performance Tuning
IBM | spark.tc
DataFramesInspired by R and Pandas DataFramesCross language support
SQL, Python, Scala, Java, RLevels performance of Python, Scala, Java, and R
Generates JVM bytecode vs serialize/pickle objects to PythonDataFrame is Container for Logical Plan
Transformations are lazy and represented as a treeCatalyst Optimizer creates physical plan
DataFrame.rdd returns the underlying RDD if neededCustom UDF using registerFunction()New, experimental UDAF support
Use DataFrames instead of
RDDs!!
IBM | spark.tc
Catalyst OptimizerConverts logical plan to physical planManipulate & optimize DataFrame transformation tree
Subquery elimination – use aliases to collapse sub-queries
Constant folding – replace expression with constantSimplify filters – remove unnecessary filtersPredicate/filter pushdowns – avoid unnecessary data
loadProjection collapsing – avoid unnecessary projections
Hooks for custom rulesRules = Scala Case Classes
val newPlan = MyFilterRule(analyzedPlan)
Implementsoas.sql.catalyst.rules.Rule
Apply to any plan stage
IBM | spark.tc
Plan DebugginggendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)
Requires explain(true)
DataFrame.queryExecution.logicalDataFrame.queryExecution.analyzed
DataFrame.queryExecution.optimizedPlan
DataFrame.queryExecution.executedPlan
IBM | spark.tc
Plan Visualization & Join/Aggregation Metrics
Effectiveness of Filter
Cost-based Optimization
is Applied
Peak Memory forJoins and Aggs
Optimized CPU-cache-aware
Binary FormatMinimizes GC &
Improves Join Perf(Project Tungsten)
New in Spark 1.5!
IBM | spark.tc
Data Sources APIExecution (o.a.s.sql.execution.commands.scala)RunnableCommand (trait/interface)
ExplainCommand(impl: case class)CacheTableCommand(impl: case class)
Relations (o.a.s.sql.sources.interfaces.scala)BaseRelation (abstract class)
TableScan (impl: returns all rows)PrunedFilteredScan (impl: column pruning and predicate push-
down)InsertableRelation (impl: insert or overwrite data using Save-
Mode)Filters (o.a.s.sql.sources.filters.scala)
Filter (abstract class for all filter pushdowns for this data source)EqualToGreaterThanStringStartsWith
IBM | spark.tc
Creating a Custom Data SourceStudy Existing Native and Third-Party Data Source Impls
Native: JDBC (o.a.s.sql.execution.datasources.jdbc)class JDBCRelation extends BaseRelation
with PrunedFilteredScan with InsertableRelation
Third-Party: Cassandra (o.a.s.sql.cassandra)class CassandraSourceRelation extends BaseRela-
tion with PrunedFilteredScan with InsertableRelation
IBM | spark.tc
Contributing a Custom Data Sourcespark-packages.orgManaged byContains links to externally-managed github
projectsRatings and commentsSpark version requirements of each package
Exampleshttps://github.com/databricks/spark-csvhttps://github.com/databricks/spark-avrohttps://github.com/databricks/spark-redshift
Partitions, Pruning, Pushdowns
IBM | spark.tc
Demo Dataset (from previous Spark After Dark talks)
RATINGS ========
UserID,ProfileID,Rating
(1-10)
GENDERS========
UserID,Gender (M,F,U)
<-- Totally -->
Anonymous
IBM | spark.tc
PartitionsPartition based on data usage patterns/root/gender=M/… /gender=F/… <-- Use case: access users by gender
/gender=U/…Partition Discovery
On read, infer partitions from organization of data (ie. gen-der=F)Dynamic Partitions
Upon insert, dynamically create partitionsSpecify field to use for each partition (ie. gender)SQL: INSERT TABLE genders PARTITION (gender) SELECT …DF:
gendersDF.write.format(”parquet").partitionBy(”gender”).save(…)
IBM | spark.tc
PruningPartition PruningFilter out entire partitions of rows on partitioned
dataSELECT id, gender FROM genders where gender = ‘U’
Column PruningFilter out entire columns for all rows if not re-
quiredExtremely useful for columnar storage formats
Parquet, ORCSELECT id, gender FROM genders
IBM | spark.tc
PushdownsPredicate (aka Filter) Pushdowns
Predicate returns {true, false} for a given function/condition
Filters rows as deep into the data source as possibleData Source must implement PrunedFilteredScan
Native Spark SQL Data Sources
IBM | spark.tc
Spark SQL Native Data Sources - Source Code
IBM | spark.tc
JSON Data SourceDataFrameval ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.j-son.bz2") -- or --
val ratingsDF = sqlContext.read.json("file:/root/pipeline/datasets/dating/ratings.j-
son.bz2")
SQL CodeCREATE TABLE genders USING jsonOPTIONS
(path "file:/root/pipeline/datasets/dating/genders.j-son.bz2")
Convenience Method
IBM | spark.tc
JDBC Data SourceAdd Driver to Spark JVM System Classpath
$ export SPARK_CLASSPATH=<jdbc-driver.jar>
DataFrameval jdbcConfig = Map("driver" -> "org.postgresql.Driver",
"url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename")
df.read.format("jdbc").options(jdbcConfig).load()
SQLCREATE TABLE genders USING jdbc
OPTIONS (url, dbtable, driver, …)
IBM | spark.tc
Parquet Data SourceConfigurationspark.sql.parquet.filterPushdown=truespark.sql.parquet.mergeSchema=truespark.sql.parquet.cacheMetadata=true
spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]DataFrames
val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet")gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet")
SQLCREATE TABLE genders USING parquetOPTIONS
(path "file:/root/pipeline/datasets/dating/genders.parquet")
IBM | spark.tc
ORC Data SourceConfiguration
spark.sql.orc.filterPushdown=trueDataFrames
val gendersDF = sqlContext.read.format("orc").load("file:/root/pipeline/datasets/dating/genders")
gendersDF.write.format("orc").partitionBy("gender").save("file:/root/pipeline/datasets/dating/genders")
SQLCREATE TABLE genders USING orcOPTIONS
(path "file:/root/pipeline/datasets/dating/genders")
Third-Party Data Sources
spark-packages.org
IBM | spark.tc
CSV Data Source (Databricks)Github
https://github.com/databricks/spark-csv
Mavencom.databricks:spark-csv_2.10:1.2.0
Codeval gendersCsvDF = sqlContext.read
.format("com.databricks.spark.csv")
.load("file:/root/pipeline/datasets/dating/gen-der.csv.bz2")
.toDF("id", "gender") toDF() defines column names
IBM | spark.tc
Avro Data Source (Databricks)Github
https://github.com/databricks/spark-avro
Mavencom.databricks:spark-avro_2.10:2.0.1
Codeval df = sqlContext.read
.format("com.databricks.spark.avro") .load("file:/root/pipeline/datasets/dating/gen-der.avro")
IBM | spark.tc
Redshift Data Source (Databricks)Github
https://github.com/databricks/spark-redshift
Mavencom.databricks:spark-redshift:0.5.0
Codeval df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load()
Copies to S3 for fast, parallel reads vs
single Redshift Master bottleneck
IBM | spark.tc
ElasticSearch Data Source (Elastic.co)Githubhttps://github.com/elastic/elasticsearch-hadoop
Mavenorg.elasticsearch:elasticsearch-spark_2.10:2.1.0
Codeval esConfig = Map("pushdown" -> "true", "es.nodes" -> "<host-
name>", "es.port" -> "<port>")
df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)
.options(esConfig).save("<index>/<document>")
IBM | spark.tc
Cassandra Data Source (DataStax)Githubhttps://github.com/datastax/spark-cassandra-connector
Mavencom.datastax.spark:spark-cassandra-connector_2.10:1.5.0-
M1
CoderatingsDF.write.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace"->"dating","table"->"rat-ings"))
.save()
IBM | spark.tc
REST Data Source (Databricks)Coming Soon!
https://github.com/databricks/spark-rest?
Michael ArmbrustSpark SQL Lead @ Databricks
IBM | spark.tc
DynamoDB Data Source (IBM Spark Tech Center) Coming Soon!
https://github.com/cfregly/spark-dynamodb
Me Erlich
IBM | spark.tc
SparkSQL Performance Tuning (oas.sql.SQL-Conf)spark.sql.inMemoryColumnarStorage.compressed=trueAutomatically selects column codec based on data
spark.sql.inMemoryColumnarStorage.batchSizeIncrease as much as possible without OOM – improves compression and GC
spark.sql.inMemoryPartitionPruning=trueEnable partition pruning for in-memory partitions
spark.sql.tungsten.enabled=trueCode Gen for CPU and Memory Optimizations (Tungsten aka Unsafe Mode)
spark.sql.shuffle.partitionsIncrease from default 200 for large joins and aggregations
spark.sql.autoBroadcastJoinThresholdIncrease to tune this cost-based, physical plan optimization
spark.sql.hive.metastorePartitionPruningPredicate pushdown into the metastore to prune partitions early
spark.sql.planner.sortMergeJoinPrefer sort-merge (vs. hash join) for large joins
spark.sql.sources.partitionDiscovery.enabled & spark.sql.sources.parallelPartitionDiscovery.threshold
Enable automatic partition discovery when loading data
IBM | spark.tc
Related Linkshttps://github.com/datastax/spark-cassandra-connectorhttp://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/https://github.com/phatek-dev/anatomy_of_spark_dataframe_api/https://databricks.com/blog/…
IBM | spark.tc
Upcoming Advanced Apache Spark MeetupsProject Tungsten Data Structs & Algos for CPU & Memory Optimization
Nov 12th, 2015Text-based Advanced Analytics and Machine Learning
Jan 14th, 2016ElasticSearch-Spark Connector w/ Costin Leau (Elastic.co) & Me
Feb 16th, 2016Spark Internals Deep Dive
Mar 24th, 2016Spark SQL Catalyst Optimizer Deep Dive
Apr 21st, 2016
Special Thanks to DataStax!!IBM Spark Tech Center is Hiring!
Only Fun, Collaborative People - No Erlichs!
IBM | spark.tc
Sign up for our newsletter at
Thank You!
Power of data. Simplicity of design. Speed of innovation.
Power of data. Simplicity of design. Speed of innovation.IBM Spark