Distributed Computing on your Cluster with Anaconda - Webinar 2015

55
Distribute Computing on your Cluster with Anaconda

Transcript of Distributed Computing on your Cluster with Anaconda - Webinar 2015

Page 1: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Distribute Computing on your Cluster with Anaconda

Page 2: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Presenter BioKristopher Overholt received his Ph.D. in Civil Engineering from The University of Texas at Austin.

Prior to joining Continuum, he worked at theNational Institute of Standards and Technology (NIST),Southwest Research Institute (SwRI), andThe University of Texas at Austin.

Kristopher has 10+ years of experience in areas including applied research, scientific and parallel computing, system administration, open-source software development, and computational modeling.

Kristopher Overholt Software Engineer

Continuum Analytics

Page 3: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Overview

• Overview of Anaconda

• Cluster Functionality of Anaconda

• Demo: Distributed Natural Language Processing

• Demo: Distributed Image Processing with GPUs

• Demo: Distributed SQL Queries on 1 TB of Data

• Anaconda Use Cases for your Enterprise

Page 4: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Overview of Anaconda

Page 5: Distributed Computing on your Cluster with Anaconda - Webinar 2015

is….the modern open source analytics platform powered by Pythonthe fastest growing open data science language• Easy to Build, Maintain & Deploy Analytics• Talks with Everything, Runs Anywhere• High Performance, Scalable Analytics

Page 6: Distributed Computing on your Cluster with Anaconda - Webinar 2015

AnacondaAccelerating Adoption of Python for Enterprises

COLLABORATIVE NOTEBOOKSwith publication, authentication, & search

Jupyter/ IPython

PYTHON & PACKAGE MANAGEMENTfor Hadoop & Apache stack Spark

PERFORMANCEwith compiled Python for lightning fast execution

Numba

VISUAL APPSfor interactivity, streaming, & BigBokeh

SECURE & ROBUST REPOSITORYof data science libraries, scripts, & notebooks

Conda

ENTERPRISE DATA INTEGRATIONwith optimized connectors & out-of-core

processing

NumPy & Pandas

Page 7: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Anaconda for Data ScienceEmpowering Everyone on the Team

Data Scientist• Advanced analytics with Python & R• Simplified library management• Easily share data science notebooks & packages

Developer• Support for common APIs & data formats• Common language with data scientists• Python extensibility with C, C++, etc.

Business Analyst• Collaborative interactive analytics with

notebooks• Rich browser based visualizations• Powerful MS Excel integration

Data Engineer• Powerful & efficient libraries for data

transformations • Robust processing for noisy dirty data• Support for common APIs & data formats

Ops• Validated source of up-to-date packages including indemnification • Agile Enterprise Package Management• Supported across platforms

Computational Scientist• Rich set of advanced analytics• Trusted & production ready libraries for

numerics• Simplified scale up & scale out on clusters &

GPUs

Page 8: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Modern Analytics Stack

Page 9: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Write Once, Deploy AnywhereM

ANAG

ED

PYTH

ON

Explore & Visualize

Python & R Advanced Analytics

High Performance & Scalability

Data Engineering & Analysis

Collaboration & Integration

Servers Linux,Windows

OSX

GPUs&HighEndWorkstations

Linux&Windows

NVIDIA,AMD,X86/ARM

Clusters Yarn,Mesos,MPI

Power8,LSF,SungridEngine

NoSQL MongoDB

Cassandra/DataStax

Hadoop Cloudera,Hortonworks

ApacheHadoop&Spark

Files MicrosoftExcel

Trifacta,Import.io

DW&SQL AnySQLDB

AnySQLDW,Impala

Page 10: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Cluster Functionality of Anaconda

Page 11: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Anaconda: Scaled up Python for your Enterprise

• Analysts, domain experts, quants, statisticians, data scientists, etc. want to leverage Python and existing libraries

• Newer analytics engines leverage existing runtimes, including Python and R (PySpark, SparkR) conda

NumPy SciPy Pandas Scikit-learn Jupyter/ IPython

Numba Matplotlib Spyder Numexpr Cython Theano

Scikit-image NLTK NetworkX IRKernel dplyr shiny

ggplot2 tidyr caret nnet And 330+ packages

PYTHON & R OPEN SOURCE ANALYTICS

Page 12: Distributed Computing on your Cluster with Anaconda - Webinar 2015

For data scientists:

• Scaled-up Analytics Develop and deploy the same code/environment on your local machine and the cluster

• Cluster Management Easily provision and manage your cluster stack and data analysis tools/environments

For system administrators and DevOps:

• Environment Management Provide the tools your data scientists need at enterprise scale

• Remote Packaging Easily deploy Python (or R, or Julia, or…) applications to your Spark/Hadoop cluster

Anaconda-Powered Cluster for your Enterprise

Page 13: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Distributed Systems

Databases Stats/Machine Learning

Scientific Computing

Modern Analytics Ecosystem

Page 14: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Cluster Architecture Diagram

Client Machine Compute Node

Compute Node

Compute Node

Head Node

Page 15: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Remote Conda and Cluster Management

New Spark/Hadoop clusters • Create and provision a Spark/Hadoop cluster with a few simple steps

• Work on the cloud or with your existing in-house servers

Existing Spark/Hadoop clusters

• Deploy and manage conda packages/environments on cluster nodes

• Solves the remote packaging problem

• Empower data scientists without sacrificing control of your cluster

Page 16: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Creating and Provisioning a Cluster

Page 17: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Defining a Cloud-Based Cluster

aws_east: cloud_provider: ec2 keyname: anaconda-cluster location: us-east-1 private_key: ~/.ssh/anaconda-cluster.pem secret_id: ********** secret_key: **********

Provider Profile

name: spark-cluster node_id: ami-d05e75b8 node_type: m3.xlarge num_nodes: 4 provider: aws_east user: ubuntu

Page 18: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Defining a Bare-Metal Cluster

bare_metal:

cloud_provider: none

private_key: ~/.ssh/my-private-key

Provider Profilename: spark-cluster

provider: bare_metal

num_nodes: 4

machines:

head: - 192.168.1.1

compute:

- 192.168.1.2

- 192.168.1.3

- 192.168.1.4

Page 19: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Creating and Using a Cluster

• Define provider

• Remote conda

• Define profile

• Install plugins

• Create cluster

~/.acluster/providers.yaml

~/.acluster/profiles.d/profile.yaml

acluster create cluster_name -p profile

acluster conda install numpy scipy

acluster install spark-yarn notebook

Page 20: Distributed Computing on your Cluster with Anaconda - Webinar 2015
Page 21: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Remote Conda and Cluster Management Commands

Page 22: Distributed Computing on your Cluster with Anaconda - Webinar 2015

acluster conda install numpy scipy pandas numba

acluster conda create -n py34 python=3.4 numpy scipy pandas

acluster conda list

acluster conda info

acluster conda push environment.yml

acluster conda setenv py34

Remote Conda Commands

Install packages

List packages

Create environment

Conda information

Push environment

Set default environment

Page 23: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Cluster Management Commands

Create cluster

Install plugins

List active clusters

SSH to nodes

Put/get files

Run command

acluster create spark-cluster -p spark-profile

acluster list

acluster install spark-yarn notebook

acluster ssh

acluster put data.hdf5 /home/ubuntu/data.hdf5

acluster 'cmd apt-get install ...'

acluster submit spark_script.pySubmit script

Page 24: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Demo: Distributed Natural Language Processing

Page 25: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Demo Overview

This demo shows a simple PySpark job that uses the NLTK library, a popular Python package for processing human language data.

This demo will show the installation of Python packages on the cluster, the use of Spark and the YARN resource manager, and remote execution of the Spark job on the cluster.

Page 26: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Application

Analytics

Data

Server

Jupyter/IPython Notebook

Spark, NLTK

Local files on each node

Bare-metal or Cloud-based cluster

Page 27: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Demo Step-by-Step• Create cluster

– 4 nodes, m3.large, 2 vCPUs, 7.5 GB RAM

• Install Spark, YARN, and Notebook plugins

• Remotely install conda packages

• Parallel download of data onto cluster nodes

• Use Spark and NLTK to tokenize words and tag parts of speech

– Remotely submitting the script to the Spark cluster

– Interactively in a notebook on the Spark cluster

Page 28: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Initialize SparkContext

Perform distributedNLTK operations

Specify location of data

Page 29: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Demo: Distributed Image Processing with GPUs

Page 30: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Demo OverviewTo demonstrate the capability of running a distributed job in PySpark using GPUs, this demo uses Numba and the CUDA platform to perform image processing.

This demo executes two-dimensional FFT convolution on images in grayscale and compares the execution time of CPU-based andGPU-based calculations.

Page 31: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Application

Visualization

Data

Server

Jupyter/IPython Notebook

matplotlib

Spark, Numba, SciPy, PIL

HDFS

Analytics

GPU-enabled Bare-metal or Cloud-based Cluster

Page 32: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Demo Step-by-Step• Create cluster

– 4 nodes, g2.2xlarge, 8 vCPUs, 15 GB RAM, 1 GPU

• Install Spark, YARN, HDFS, and Notebook plugins

• Bootstrap CUDA drivers on all nodes

• Remotely install conda packages

• High-performance parallel download of data into HDFS

• Use Spark, Numba, and GPU to perform FFT convolution on images

– Interactively in a notebook on the Spark cluster

Page 33: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Demo: Interactively Querying 1 TB of Data in Distributed SQL Engines

Page 34: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Demo Overview

In this demo, we’ll interactively query, explore, and visualize a data set of approximately 1.8 billion comments (~1 TB).

Blaze Bokeh

Page 35: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Application

Visualization

Data

Server

Analytics

Jupyter/IPython Notebook

Bokeh

HDFS

Bare-metal or Cloud-based cluster

Blaze/pandas

Hive Impala

Page 36: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Demo Step-by-Step• Create cluster

– 8 nodes, m3.2xlarge, 8 vCPUs, 30 GB RAM, 1 TB storage

• Install HDFS, Hive, Impala, and Notebook plugins

• High-performance parallel download of data into HDFS

• Move, convert, and load data into distributed SQL databases

• Run interactive queries from notebook using Blaze

• Interactively plot and explore results using Bokeh

Page 37: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Amazon S3 (JSON)

Moving and Loading 1 TB of Data

Time Moving Data Time Querying Data

2 hours -

< 1 minute 30 minutes

1 hour 5 minutes

< 1 minute 5 seconds

HDFS (JSON)

Hive (JSON)

Hive (Parquet)

Impala (Parquet)

Page 38: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Anaconda Use Cases for your Enterprise

Page 39: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Client Machine

Head Node

Compute Nodes

Data Science Use Case

Page 40: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Data Science Use Case

Analyst installs Anaconda on their local machine

1.

Client Machine

Page 41: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Client Machine

Head Node

Compute Nodes

Data Science Use Case

Analyst creates a sandbox Spark/Hadoop cluster and installs plugins

2.

Page 42: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Client Machine

Head Node

Compute Nodes

Data Science Use Case

Analyst deploys packages, environments,and data to cluster nodes

3.

Page 43: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Client Machine

Head Node

Compute Nodes

Data Science Use Case

Analyst submits jobs to Spark/Hadoop cluster

4.

Page 44: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Admin

Head Node

Compute Nodes

Enterprise Use Case

Analyst Machine

Anaconda Server

Page 45: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Analyst Machine

Enterprise Use Case

Analyst ships packages, environments, and data to on-premises repository1.

Page 46: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Admin

Head Node

Compute Nodes

Enterprise Use Case

Admin deploys packages, environments, and data to cluster nodes2.

Page 47: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Analyst Machine

Head Node

Compute Nodes

Enterprise Use Case

Analyst submits jobs to Spark/Hadoop cluster3.

Page 48: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Anaconda Cluster Plugins

Conda Hive Elasticsearch Jupyter/IPython Notebook

Spark Impala Logstash IPython Parallel

YARN Storm Kibana Dask

HDFS Zookeeper Ganglia

Page 49: Distributed Computing on your Cluster with Anaconda - Webinar 2015

System Architecture Diagram

HMS

CM

ISS

HS

NN

RM

S

ID

SG

SNN

JHS

NM

DN

G

WHCS

HS2

ACH Anaconda Cluster Head

ACC

AS

CM G

SG HS

Head Node

AS ACH

YG

YG G

Secondary Head Node

ICS

ICS

ISS S

YG

Edge Node

HFS

HFS

G

H

HS2

HMS WHCS

Edge Node

H

SG

Anaconda Server

Zookeeper Server

Hadoop Manager

Impala StateStore

Impala Daemon

Impala Catalog Server

History Sever (Spark)

Spark Gateway

Resource Manage (YARN)

JobHistory Server Other Services

Hue

NameNode (HDFS)

Secondary NameNode

DataNode

HttpFS

Hive Metastore

Gateway

WebHCat Server

HiveServer2

Yarn GateWay

NodeManager

Anaconda Cluster Compute

ACCACC

Compute Nodes

DN ID

SG ACC

Page 50: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Network Architecture Diagram

Client Machine Compute Node

Compute Node

Compute Node

Head Node Anaconda Server

Ports: 22 (SSH)

Ports: 4505 (Salt) 4506 (Salt)

Ports: 8080 (HTTP)

8443 (HTTPS)

Page 51: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Anaconda Subscriptions and Resources

Page 52: Distributed Computing on your Cluster with Anaconda - Webinar 2015

$ conda install anaconda-client

$ anaconda login

$ conda install anaconda-cluster -c anaconda-cluster

Test-Drive Anaconda on a Cluster

1. Register for an Anaconda Cloud account at Anaconda.org

2. Download Anaconda Cluster using Conda

3. Create a sandbox/demo cluster

Page 53: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Priority 1 support with Dedicated Customer

Support Rep

ANACONDAENTERPRISE

CONTACT USCONTACT US

ANACONDAPRO

Priority 1 support

DOWNLOAD

ANACONDA

Community Support

FREE FOREVER

Open Source Modern Analytics Platform

Powered by Python

Anaconda with Support & Indemnification

Priority 1 support

ANACONDAWORKGROUP

CONTACT US

Anaconda with High Performance and Team

Collaboration

Anaconda with Scalable High Performance and

Team Collaboration

per year

+ $1,000 per year foradditional users

$10,000Starting at

+ $3,000 per year foradditional users

per year

$30,000Starting at

+ $6,000 per year foradditional users

per year

$60,000Starting at

Anaconda Subscriptions

Page 54: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Contact Information and Additional Details

• Contact [email protected] for more information aboutAnaconda subscriptions, consulting, or training

• View documentation and examples at

docs.continuum.io/anaconda-cluster

• View demo notebooks on Anaconda Cloud

notebooks.anaconda.org/anaconda-cluster

Page 55: Distributed Computing on your Cluster with Anaconda - Webinar 2015

Thank you

Email: [email protected]

Twitter: @ContinuumIO

Kristopher Overholt

Twitter: @koverholt