Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed...

Apache Spark:Sreeram Nudurupati May 2015

What is Spark?A distributed computing platform designed to be

Fast

Fast to develop distributed applications

Fast to run distributed applications

General Purpose

A single framework to handle a variety of workloads

Batch, interactive, iterative, streaming, SQL

Spark Architecture

How to run Spark?Local

Not really distributed computing

Cluster

Standalone Scheduler + Shared File System

YARN

Mesos

Amazon EC2 + S3

Google Compute Engine + Mesosphere

Databricks Cloud

Spark Cluster

• Mesos• Yarn• Standalone

Spark BasicsRDD - Resilient Distributed Datasets

Spark’s primary abstraction

A distributed collection of items called elements

Can be created from a variety of sources

Immutable

RDD Visualized

RDD 1 Partition 1

RDD 2

RDD 3

Partition 2 Partition 3

Partition 1 Partition 2

Partition 1 Partition 3 Partition 4Partition 2

Node 1 Node 4Node 3Node 2

RDD OperationsTransformations

Operate on an RDD and return a new RDD

Are lazily evaluated

Actions

Return a value after running a computation on an RDD

Lazy Evaluation

Evaluation happens only when an action is called

Deferring decisions for better runtime optimization

data back to Driver

Transformation 1

Transformation 2

Action

map

filter

collect

DataFramesExtension of RDD API and a Spark SQL abstraction

Distributed collection of data with named columns

Equivalent to RDBMS tables or data frames in R/Pandas

Can be built from a variety of structured data sources

Hive tables, JSON, Databases, RDDs etc.

Why DataFrame?Lot of data formats are structured

Schema-on-read

data has inherent structure and needed to make sense of it

RDD programming with structured data is not intuitive

SchemaRDD = RDD(ROW) + Schema

Write SQLs

Use Domain Specific Language (DSL)

RDD vs DataFrameDataFrame

Inbuilt support for a variety of data formats

A more feature rich DSL

Memory management with Java objects is challenging

Future GC free managed memory in the future

Execution optimized by Catalyst

JVM bytecode generated for any/all APIs

RDD vs DataFrame

DataFrame OpsprintSchema prints schema

show(N) shows N rows

join joins two DFs

apply returns the selected column

select returns new DF with selected columns

selectExpr use a SQL query to select

filter same as where

groupBy groups using specified columns

SaveAs(JSON/Parquet/Table)

saveAsTable saves to a Hive table

createJDBCTable save to a JDBC database

SQLContext OpsparquetFile loads parquet file

into a DF

jsonFile loads JSON file into a DF

load creates a DF from a source file

createExternalTable

creates a Hive external table

jdbc returns new DF with selected columns

sql executes SQL query

table return specified table as DF

cacheTable cache table in-memory

What Next?Spark Community: spark.apache.org/community.html

Worldwide Events: goo.gl/2YqJZK

Video, presentation archives: spark-summit.org

Dev resources: databricks.com/spark/developer-resources

Workshops: databricks.com/services/spark-training

Books: Learning Spark, Advanced Analytics with Apache Spark

Github: https://github.com/snudurupati/spark_training

http://spark.apache.org/community.html

http://goo.gl/2YqJZK

http://spark-summit.org/

http://databricks.com/spark/developer-resources

https://databricks.com/services/spark-training

https://github.com/snudurupati/spark_training

Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed...

Documents

Transcript of Apache Spark - Meetupfiles.meetup.com/18532292/Apache Spark.pdf · What is Spark? A distributed...