DATA SCIENCE USING SPARK: AN INTRODUCTION
Transcript of DATA SCIENCE USING SPARK: AN INTRODUCTION
TOPICS COVERED
Introduction to Spark
Getting Started with Spark
Programming in Spark
Data Science with Spark
What next?
2
DATA SCIENCE PROCESS
3
Real World
Raw data is
collected
Raw data is
collectedData is
processed
Data is
processedClean
Data
Clean
Data
Exploratory
Data Analysis
Exploratory
Data Analysis
Machine
Learning
Algorithms
Statistical
Models
Machine
Learning
Algorithms
Statistical
Models
Build
Data Product
Build
Data Product
Communicate
----------------
visualizations
----------------Report
Findings
Communicate
----------------
visualizations
----------------Report
Findings
Make Decisions
Source: Doing Data Science by Rachel Schutt & Cathy O’Neil
DATA SCIENCE & DATA MINING
Distinctions are blurred
4
StatisticsMachine
Learning
Database
Management
Management
Science
Natural Language
ProcessingText Mining Web Mining
(Structured)
Data Mining
(Unstructured)
Data Mining
Data MiningVisual Data
Mining
Big Data
Engineering
Business
AnalyticsData Science
Domain
Knowledge
Knowledge Discovery
WHAT DO WE NEED TO SUPPORT DATA SCIENCE WORK?
Data Input /Output
Ability to read data in multiple formats
Ability to read data from multiple sources
Ability to deal with Big Data (Volume, Velocity, Veracity and Variety)
Data Transformations
Easy to describe and perform transformations on rows and columns of data
Requires abstraction of data and a dataflow paradigm
Model Development
Library of Data Science Algorithms
Ability to import / export models from other sources
Data Science pipelines / workflow Development
Analytics Applications Development
Seamless integration with programming languages / IDEs
5
WHAT IS SPARK?
7
▪ A distributed computing platform designed to be
▪ Fast
▪ General Purpose
▪ A general engine that allows combination of multiple types of computations
▪ Batch
▪ Interactive
▪ Iterative
▪ SQL Queries
▪ Text Processing
▪ Machine learning
8
Fast/Speed
Computations in memory
Faster than MR even for disk computations
Generality
Designed for a wide range of workloads
Single Engine to combine batch, interactive,
iterative, streaming algorithms.
Has rich high-level libraries and simple native
APIs in Java, Scala and Python.
Reduces the management burden of
maintaining separate tools.
10
CLUSTER MANAGERS
▪ Can run on a variety of cluster managers
▪ Hadoop YARN - Yet Another Resource Negotiator is a cluster management technology and one of the key features in Hadoop 2.
▪ Apache Mesos - abstracts CPU, memory, storage, and other compute resources away from machines, enabling fault-tolerant and elastic distributed systems.
▪ Spark Standalone Scheduler – provides an easy way to get started on an empty set of machines.
▪ Spark can leverage existing Hadoop infrastructure
11
SPARK HISTORY
▪ Started in 2009 as a research project in UC Berkeley RAD lab which became AMP Lab.
▪ Spark researchers found that Hadoop MapReduce was inefficient for iterative and interactive computing.
▪ Spark was designed from the beginning to be fast for interactive, iterative with support for in-memory storage and fault-tolerance.
▪ Apart from UC Berkeley, Databricks, Yahoo! and Intel are major contributors.
▪ Spark was open sourced in March 2010 and transformed into Apache Foundation project in June 2013.
C O N F I D E N T I A L A N D P R O P R I E T A R Y 12
SPARK VS HADOOP
Hadoop MapReduce
Mostly suited for batch jobs
Difficulty to program directly in MR
Batch doesn’t compose well for large apps
Specialized systems needed as a workaround
Spark
Handles batch, interactive, and real-time
within a single framework
Native integration with Java, Python, Scala
Programming at a higher level of abstraction
More general than MapReduce
14
GETTING STARTED WITH SPARK …..NOT COVERED TODAY!
There are multiple ways of using Spark
▪ Certified Spark Distributions
▪ Datastax Enterprise (Cassandra + Spark)
▪ HortonWorks HDP
▪ MAPR
▪ Local/Standalone
▪ Databricks cloud
▪ Amazon AWS EC2
15
LOCAL MODE▪ Install Java JDK 6/7 on MacOSX or Windows
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
▪ Install Python 2.7 using Anaconda (only on Windows)
https://store.continuum.io/cshop/anaconda/
▪ Download Apache Spark from Databricks, unzip the downloaded file to a convenient location
http://training.databricks.com/workshop/usb.zip
▪ Connect to the newly created spark-training directory
▪ Run the interactive Scala shell (REPL)
./spark/bin/spark-shell
val data = 1 to 1000
val distData = sc.parallelize(data)
val filteredData = distData.filter(s => s<25)
filteredData.collect()
16
DATABRICKS CLOUD
▪ A hosted data platform powered by Apache Spark
▪ Features
▪ Exploration and Visualization
▪ Managed Spark Clusters
▪ Production Pipelines
▪ Support for 3rd party apps (Tableau, Pentaho, Qlik View)
▪ Databricks Cloud Trail
http://databricks.com/registration
▪ Demo
18
DATABRICKS CLOUD
▪ Notebooks
▪ Python
▪ Scala
▪ SQL
▪ Visualizations
▪ Markup
▪ Comments
▪ Collaboration
22
AMAZON EC2
▪ Spark binary ships with a spark-ec2 script to manage clusters on EC2
▪ Launching Spark cluster on EC2
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>
▪ Running Applications
./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>
▪ Terminating a cluster
./spark-ec2 destroy <cluster-name>
▪ Accessing data in S3
s3n://<bucket>/path
25
Scala – Scalable Language
▪ Scala is a multi-paradigm programming language with focus on the functional programming paradigm.
▪ In functional programming functions are used and they use variables that are immutable.
▪ Every operator, variable and function is an object.
▪ Scala generates bytecode that runs on the top of any JVM and can also use any of the java libraries.
▪ Spark is completely written in Scala.
▪ Spark SQL, GraphX, Spark Streaming etc. are libraries written in Scala.
▪ Scala Crash Course by Holden Karau @databricks
lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course.pdf
Write programs in terms of transformations on distributed datasets
Resilient Distributed Datasets (RDDs)
Read-only collections of objects that can be stored in memory or disk across a cluster
Partitions are automatically rebuilt on failure
Parallel functional transformations ( map, filter, ..)
Familiar Scala collections API for distributed data and computation
Lazy transformations
26
Spark Model
27
Spark Core
RDD – Resilient Distributed Dataset
▪ A primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel.
▪ Two Types
▪ Parallelized Scala collections
▪ Hadoop datasets
▪ Transformations and Actions can be performed
on RDDs.
Transformations
▪ Operate on an RDD and return a new RDD.
▪ Are Lazily Evaluated
Actions
▪ Return a value after running a computation on a RDD.
▪ The DAG is evaluated only when an action takes place.
Interactive Queries and prototyping
Local, YARN, Mesos
Static type checking and auto complete
28
Spark Shell
• Real time computation similar to Storm
• Input distributed to memory for fault tolerance
• Streaming input in to sliding windows of RDDs
• Kafka, Flume, Kinesis, HDFS
31
Spark Streaming
WHAT DO WE NEED TO SUPPORT DATA SCIENCE WORK?
Data Input /Output
Ability to read data in multiple formats
Ability to read data from multiple sources
Ability to deal with Big Data (Volume, Velocity, and Variety)
Data Transformations
Easy to describe and perform transformations on rows and columns of data
Requires abstraction of data and a dataflow paradigm
Model Development
Library of Data Science Algorithms
Ability to import / export models from other sources
Data Science pipelines / workflow Development
Analytics Applications Development
Seamless integration with programming languages / IDEs
35
WHY SPARK FOR DATA SCIENCE?
C O N F I D E N T I A L A N D P R O P R I E T A R Y 36
SPARK is HOTSPARK is HOT
FASTDistributed In-memory Platform
Scalable Small to Big Data; Well integrated into the Big Data Ecosystem
Expressive Simple, higher level abstractions for describing computations
Flexible Extendible, Multiple language bindings (Scala, Java, Python, R)
37
Traditional Data Science Tools
Matlab
R
SAS
SPSS
RapidMiner
And many others….
Designed to work on single machines
Proprietary & Expensive
38
What is available in Spark?
Basic RDD (Transformations & Actions) Basic RDD (Transformations & Actions)
Extensions to RDD (SchemaRDD, RRDD, RDPG, DStreams)Extensions to RDD (SchemaRDD, RRDD, RDPG, DStreams)
Library of Algorithms(MLlib, R packages, Mahout?, Graph Algorithms)Library of Algorithms(MLlib, R packages, Mahout?, Graph Algorithms)
Analytics Workflows (ML Pipeline)Analytics Workflows (ML Pipeline)
DATA TYPES FOR DATA SCIENCE (MLLIB)
Single Machine Data Types Distributed Data Types
(supported by RDDs)
Local Vector
Labeled Point
Local Matrix
39
Distributed Matrix
RowMatrix
IndexedRowMatrix
CoordinateMatrix
42
R
Spark
Context
(ref. in R)
Java
Spark
Context
Local
R to Spark Dataflow
Spark
Executor
Spark
Executor
R
R
Worker
Worker
tasks
tasks
broadcast vars
R pacakges
broadcast vars
R pacakges