Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache...

38
Apache Spark in the Cloud Zbyněk Roubalík Senior Quality Engineer, Red Hat February 15 2018

Transcript of Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache...

Page 1: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud

Zbyněk RoubalíkSenior Quality Engineer, Red Hat

February 15 2018

Page 2: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík2

Technologies

● Apache Spark

● Docker

● Kubernetes

● OpenShift

Page 3: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík3

Apache Spark in the Cloud

aka

How to create and deploy Apache Spark

applications to cloud native environments like

OpenShift

Page 4: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík4

What is cloud native?

● Containerized● Dynamically orchestrated● Microservice oriented

● www.cncf.io/about/faq

Page 5: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík5

Containers

● A container image is a lightweight, stand-alone,

executable package of a piece of software that

includes everything needed to run it: code, runtime,

system tools, system libraries, settings.

● https://www.docker.com/what-container

Page 6: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík6

VM vs Containers

Page 7: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík7

Containers

● Cloud vs standard deployment model

● Pets vs Cattle

● Developers + Operations (Admins) → DevOps

● Docker

Page 8: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík8

Kubernetes

● Container cluster manager

Page 9: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík9

Kubernetes

● Based on etcd – distributed clustered key value store● Smallest deployable unit is Pod

Page 10: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík10

OpenShift

● Open Source Container Application Platform● Focused on application (not just containers as a

concept) and developer experience

Page 11: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík11

OpenShift

● Sits on the top of Kubernetes● Source code, builds and deployments management● S2I - Source to Image● Application lifecycle management (CI/CD)● Service catalog (Language runtimes, Middleware,

Databases)● Security

Page 12: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík12

OpenShift architecture

Page 13: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík13

Apache Spark

● Fast and general engine for large-scale data processing

● Distributed computation system

● Provides high-level APIs in Java, Scala, Python and R

● Supports a rich set of tools for Big Data, AI, ML● Spark SQL for SQL and structured data processing● MLlib for machine learning● GraphX for graph processing● Spark Streaming● ...

Page 14: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík14

General Spark architecture

Page 15: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík15

How to interact with Spark

● Run an application

● Start a REPL● Scala

● Python

● R

Page 16: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík16

The fundamental Spark abstraction

Resilient distributed dataset (RDD)

● are partitioned, lazy and immutable homogenous collections

● partitioned● lazy● immutable

Page 17: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík17

Resilient distributed dataset in action

Page 18: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík18

Resilient distributed dataset in action

Page 19: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík19

Resilient distributed dataset in action

Page 20: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík20

Resilient distributed dataset in action

Page 21: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík21

Resilient distributed dataset in action

Page 22: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík22

Resilient distributed dataset in action

Page 23: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík23

Resilient distributed dataset in action

Page 24: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík24

What is Spark application?

Page 25: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík25

simple.py application

● Even numbers count

Page 26: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík26

Page 27: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík27

A little more complex application

Page 28: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík28

Designing a Spark microservice

Page 29: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík29

On demand batch processing

Page 30: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík30

Continuous batch processing

Page 31: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík31

Stream processing

Page 32: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík32

OpenShift architecture - recall

Page 33: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík33

Spark on OpenShift

Page 34: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík34

Oshinko - Integrating Spark and OpenShift

Page 35: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík35

Oshinko - Integrating Spark and OpenShift

Page 36: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík36

Demo time

Page 37: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík37

Takeaways

● Containers

● Kubernetes

● OpenShift

● Apache Spark

● Oshinko tooling

Page 38: Apache Spark in the Cloud - Amazon S3 · 13 Apache Spark in the Cloud | Zbyněk Roubalík Apache Spark Fast and general engine for large-scale data processing Distributed computation

Apache Spark in the Cloud | Zbyněk Roubalík38

Спасибі!

www.github.com/radanalyticsio

[email protected]