Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed...
Transcript of Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed...
Apache Spark:Sreeram Nudurupati May 2015
What is Spark?A distributed computing platform designed to be
Fast
Fast to develop distributed applications
Fast to run distributed applications
General Purpose
A single framework to handle a variety of workloads
Batch, interactive, iterative, streaming, SQL
Spark Architecture
How to run Spark?Local
Not really distributed computing
Cluster
Standalone Scheduler + Shared File System
YARN
Mesos
Amazon EC2 + S3
Google Compute Engine + Mesosphere
Databricks Cloud
Spark Cluster
• Mesos• Yarn• Standalone
Spark BasicsRDD - Resilient Distributed Datasets
Spark’s primary abstraction
A distributed collection of items called elements
Can be created from a variety of sources
Immutable
RDD Visualized
RDD 1 Partition 1
RDD 2
RDD 3
Partition 2 Partition 3
Partition 1 Partition 2
Partition 1 Partition 3 Partition 4Partition 2
Node 1 Node 4Node 3Node 2
RDD OperationsTransformations
Operate on an RDD and return a new RDD
Are lazily evaluated
Actions
Return a value after running a computation on an RDD
Lazy Evaluation
Evaluation happens only when an action is called
Deferring decisions for better runtime optimization
data back to Driver
Transformation 1
Transformation 2
Action
map
filter
collect
DataFramesExtension of RDD API and a Spark SQL abstraction
Distributed collection of data with named columns
Equivalent to RDBMS tables or data frames in R/Pandas
Can be built from a variety of structured data sources
Hive tables, JSON, Databases, RDDs etc.
Why DataFrame?Lot of data formats are structured
Schema-on-read
data has inherent structure and needed to make sense of it
RDD programming with structured data is not intuitive
SchemaRDD = RDD(ROW) + Schema
Write SQLs
Use Domain Specific Language (DSL)
RDD vs DataFrameDataFrame
Inbuilt support for a variety of data formats
A more feature rich DSL
Memory management with Java objects is challenging
Future GC free managed memory in the future
Execution optimized by Catalyst
JVM bytecode generated for any/all APIs
RDD vs DataFrame
DataFrame OpsprintSchema prints schema
show(N) shows N rows
join joins two DFs
apply returns the selected column
select returns new DF with selected columns
selectExpr use a SQL query to select
filter same as where
groupBy groups using specified columns
SaveAs(JSON/Parquet/Table)
saveAsTable saves to a Hive table
createJDBCTable save to a JDBC database
SQLContext OpsparquetFile loads parquet file
into a DF
jsonFile loads JSON file into a DF
load creates a DF from a source file
createExternalTable
creates a Hive external table
jdbc returns new DF with selected columns
sql executes SQL query
table return specified table as DF
cacheTable cache table in-memory
Demo
What Next?Spark Community: spark.apache.org/community.html
Worldwide Events: goo.gl/2YqJZK
Video, presentation archives: spark-summit.org
Dev resources: databricks.com/spark/developer-resources
Workshops: databricks.com/services/spark-training
Books: Learning Spark, Advanced Analytics with Apache Spark
Github: https://github.com/snudurupati/spark_training