Easy JSON Data Manipulation in Spark Yin Huai – Spark Summit 2014.
Spark Summit EU talk by William Benton
-
Upload
spark-summit -
Category
Data & Analytics
-
view
294 -
download
0
Transcript of Spark Summit EU talk by William Benton
William Benton Red Hat, Inc. @willb • [email protected]
CONTAINERIZED SPARK ON KUBERNETES
BACKGROUND
BACKGROUND
BACKGROUND
BACKGROUND
BACKGROUND
BACKGROUND
BACKGROUND
BACKGROUND
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014
Networked POSIX FS
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Mesos
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014
Networked POSIX FS
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Mesos
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014
Networked POSIX FS
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
1
2
3
4
Mesos
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014
Networked POSIX FS
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
1
2
3
4
1
1
2
3
3
4
Mesos
WHAT OUR SPARK CLUSTER LOOKED LIKE IN 2014
Networked POSIX FS
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
1
2
3
4
1
1
2
3
3
4
Analytics is no longer a separate workload.
Analytics is an essential component of modern data-driven applications.
OUR GOALS
OUR GOALS
OUR GOALS
OUR GOALS
OUR GOALS
git
OUR GOALS
git
FORECAST
Motivating containerized microservices
Architectures for analytics and applications
Spark clusters in containers: practicalities and pitfalls
Play along at home
Future work
MOTIVATING MICROSERVICES
A microservice architecture employs lightweight, modular, and typically stateless components with well-defined interfaces and contracts.
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
2 + 2
BENEFITS OF MICROSERVICE ARCHITECTURES
2 + 2 5
BENEFITS OF MICROSERVICE ARCHITECTURES
2 + 2 5
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
BENEFITS OF MICROSERVICE ARCHITECTURES
?
BENEFITS OF MICROSERVICE ARCHITECTURES
?
BENEFITS OF MICROSERVICE ARCHITECTURES
?
MICROSERVICES AND SPARK
executor
1 2 3
executor
4 5 6
executor
7 8 9
executor
10 11 12
master
MICROSERVICES AND SPARK
executor
1 2 3
executor
4 5 6
executor
7 8 9
executor
10 11 12
master
λ x: x * 2
MICROSERVICES AND SPARK
executor
1 2 3
executor
4 5 6
executor
7 8 9
executor
10 11 12
master
λ x: x * 22 4 6 8 10 12 14 16 18 20 22 24
λ x: x * 2 λ x: x * 2 λ x: x * 2 λ x: x * 2
MICROSERVICES AND SPARK
executor
1 2 3
executor
4 5 6
executor
7 8 9
executor
10 11 12
master
λ x: x * 22 4 6 8 10 12 14 16 18 20 22 24
λ x: x * 2 λ x: x * 2 λ x: x * 2 λ x: x * 2
MICROSERVICES AND SPARK
executor
1 2 3
executor
4 5 6
executor
7 8 9
executor
10 11 12
master
λ x: x * 22 4 6 8 10 12 14 16 18 20 22 24
λ x: x * 2 λ x: x * 2 λ x: x * 2 λ x: x * 2
ARCHITECTURES FOR ANALYTICS AND APPLICATIONS
APPLICATION RESPONSIBILITIES
transform
transform
transform
events
databases
file, object storage
APPLICATION RESPONSIBILITIES
transform
transform
transform
aggregate
events
databases
file, object storage
APPLICATION RESPONSIBILITIES
trainmodels
transform
transform
transform
aggregate
events
databases
file, object storage
APPLICATION RESPONSIBILITIES
trainmodels
transform
transform
transform
aggregate
events
databases
file, object storage
APPLICATION RESPONSIBILITIES
archive
trainmodels
transform
transform
transform
aggregate
events
databases
file, object storage
APPLICATION RESPONSIBILITIES
archive
trainmodels
transform
transform
transform
aggregate
events
databases
file, object storage
APPLICATION RESPONSIBILITIES
archive
trainmodels
transform
transform
transform
aggregate
events
databases
file, object storage
web and mobile
reporting
developer UI
APPLICATION RESPONSIBILITIES
archive
trainmodels
transform
transform
transform
aggregate
events
databases
file, object storage
management
web and mobile
reporting
developer UI
LEGACY ARCHITECTURES
CONVENTIONAL DATA WAREHOUSE
events
CONVENTIONAL DATA WAREHOUSE
transformevents
CONVENTIONAL DATA WAREHOUSE
transformevents
UI
CONVENTIONAL DATA WAREHOUSE
transformevents
UI business logic
CONVENTIONAL DATA WAREHOUSE
transformevents
UI business logic
RDBMS
transactionprocessing
CONVENTIONAL DATA WAREHOUSE
transformevents
UI business logic
RDBMS
transactionprocessing
CONVENTIONAL DATA WAREHOUSE
transformevents
UI business logic
RDBMS
transactionprocessing
CONVENTIONAL DATA WAREHOUSE
transformevents
UI business logic
RDBMS RDBMS
transactionprocessing
CONVENTIONAL DATA WAREHOUSE
transformevents
UI business logic
RDBMS RDBMS
analysis
transactionprocessing
CONVENTIONAL DATA WAREHOUSE
transformevents
UI business logic
RDBMS RDBMS
analysis
reporting
transactionprocessing
CONVENTIONAL DATA WAREHOUSE
transformevents
UI business logic
RDBMS RDBMS
analysis
reporting
transactionprocessing
CONVENTIONAL DATA WAREHOUSE
transformevents
UI business logic
RDBMS RDBMS
analysis
interactivequeryreporting
transactionprocessing
CONVENTIONAL DATA WAREHOUSE
transformevents
UI business logic
RDBMS analyticprocessing
RDBMS
analysis
interactivequeryreporting
HADOOP-STYLE “DATA LAKE”
HDFS HDFS HDFS
HADOOP-STYLE “DATA LAKE”
HDFS HDFS HDFS HDFS HDFS
HADOOP-STYLE “DATA LAKE”
HDFS
events
HDFS HDFS HDFS HDFS
HADOOP-STYLE “DATA LAKE”
HDFS
events
HDFS HDFS HDFS HDFS
HADOOP-STYLE “DATA LAKE”
HDFS
compute
events
HDFS
compute
HDFS
compute compute compute
HDFS HDFS
MODERN ARCHITECTURES
THE LAMBDA ARCHITECTURE
events
(imprecise)analysistransform
THE LAMBDA ARCHITECTURE
events
(imprecise)analysistransform
DFS
speed layer
THE LAMBDA ARCHITECTURE
events
(imprecise)analysistransform
DFS
speed layer
THE LAMBDA ARCHITECTURE
events
(precise) analysistransform
(imprecise)analysistransform
DFS
speed layer
THE LAMBDA ARCHITECTURE
events
batch layer
(precise) analysistransform
(imprecise)analysistransform
DFS
speed layer
THE LAMBDA ARCHITECTURE
events
batch layer
federate
(precise) analysistransform
(imprecise)analysistransform
DFS
speed layer
THE LAMBDA ARCHITECTURE
events
batch layer
UIfederate
(precise) analysistransform
(imprecise)analysistransform
DFS
serving layerspeed layer
THE LAMBDA ARCHITECTURE
events
batch layer
UIfederate
(precise) analysistransform
(imprecise)analysistransform
DFS
THE KAPPA ARCHITECTURE
events
queue for “raw data” topic
THE KAPPA ARCHITECTURE
events
queue for “raw data” topic
THE KAPPA ARCHITECTURE
events
transform
queue for “preprocessed data” topic
queue for “raw data” topic
THE KAPPA ARCHITECTURE
events
transform analysis
queue for “preprocessed data” topic
queue for “analysis results” topic
queue for “raw data” topic
THE KAPPA ARCHITECTURE
events
transform analysis
queue for “preprocessed data” topic
queue for “analysis results” topic
reporting end-user UI
DATA FEDERATION IN THE COMPUTE LAYER
aggregate
trainmodels
archive
events
databases
file, object storage
management
web and mobile
reporting
developer UItransform
transform
transform
DATA FEDERATION IN THE COMPUTE LAYER
aggregate
trainmodels
archive
events
databases
file, object storage
management
web and mobile
reporting
developer UItransform
transform
transform
DATA FEDERATION IN THE COMPUTE LAYER
aggregate
trainmodels
archive
events
databases
file, object storage
management
web and mobile
reporting
developer UItransform
transform
transform
Cluster scheduler
SIDEBAR: THE MONOLITHIC SPARK ANTIPATTERN
Shared FSSpark executor
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Resource manager
app 1 app 2
app 4app 3
Cluster scheduler
SIDEBAR: THE MONOLITHIC SPARK ANTIPATTERN
Shared FSSpark executor
Spark executor
Spark executor
Spark executor
Spark executor
Spark executor
Resource manager
app 1 app 2
app 4app 3
Resource manager
ONE CLUSTER PER APPLICATION
Object storesapp 1 app 2
app 5app 4
app 3
app 6
Databases
Resource manager
ONE CLUSTER PER APPLICATION
Object storesapp 1 app 2
app 5app 4
app 3
app 6
app 2
app 4
Databases
PRACTICALITIES AND POTENTIAL PITFALLS
SCHEDULING
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
Object stores
Databases
SCHEDULING
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1
SCHEDULING
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1
SCHEDULING
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1
SCHEDULING
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1
SCHEDULING
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1
SECURITY
SECURITY
SECURITY
$SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.Worker \ master:7077
pid
root
net
SECURITY
$SPARK_HOME/bin/spark-class \ org.apache.spark.deploy.worker.Worker \ master:7077
pid
root
net
/tmp/foo
SECURITY
SECURITY
k8s namespace
SECURITY
k8s namespace*
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
POSIX filesystem
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
POSIX filesystem
✓ familiar interface
✓ interoperability with other programs
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
POSIX filesystem
✓ familiar interface
✓ interoperability with other programs
✗ unnecessary semantic guarantees
✗ difficult to manage
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
HDFS HDFS
HDFS HDFS
HDFS
HDFS
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
HDFS HDFS
HDFS HDFS
HDFS
HDFS
✓ support for legacy Hadoop installations
✗ inelastic ✗ stateful ✗ can’t collocate
compute and data
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
HDFS HDFS
HDFS HDFS
HDFS
HDFS
✓ support for legacy Hadoop installations
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
object store
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
object store
✓ interoperability
✓ fine-grained AC
✓ many implementations
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
object store
✓ interoperability
✓ fine-grained AC
✓ many implementations
✗ consistency model ✗ performance (?)
STORAGE
Kubernetes
app 1 app 2
app 5app 4
app 3
app 6
app 1 app 2
app 5app 4
app 3
app 6
object store
✓ interoperability
✓ fine-grained AC
✓ many implementations
✗ consistency model ✗ performance
“…in a cloud native architecture, the benefit of HDFS is actually very small and that is why many cloud-first organizations no longer run HDFS, or only run it as a caching layer for S3.”
—Reynold Xin on Quora (http://qr.ae/TAF4cN)
NETWORKING
NETWORKING
NETWORKING
NETWORKINGhttp://app1:8080
NETWORKINGhttp://app1:8080
✗ can’t access worker web UI (but wait for Spark 2.1!)
NETWORKINGhttp://app1:8080
✗ can’t access worker web UI (but wait for Spark 2.1!)
NETWORKINGhttp://app1:8080
✗ can’t access worker web UI (but wait for Spark 2.1!)
http://app1:80
NEXT STEPS: FUTURE WORK & PLAYING ALONG AT HOME
NEXT STEPS
Further performance evaluation
Better developer experience
Improved scheduling of Spark tasks on Kubernetes
TRY IT OUT YOURSELF
Kubernetes standalone Spark example: https://github.com/kubernetes/kubernetes/tree/master/examples/spark
Enabling Spark on OpenShift: https://github.com/radanalyticsio
Native Spark on Kubernetes proposal: https://github.com/kubernetes/kubernetes/issues/34377
@willb • [email protected]://chapeau.freevariable.com
THANKS!