Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs...

Scaling Deep Learning to 100s of GPUs on Hops Hadoop

Fabio BusoSoftware EngineerLogical Clocks AB

HopsFS: Next generation HDFS

37xNumber of fles

16xThroughput

Scale Challenge Winner (2017)

*https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi**https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf

Hops platform

Projects, Datasets, Users

HopsFS, HopsYARN, MySQL NDB Cluster

Spark, Tensorfow, Hive, Kafka, Flink

Jupyter, Zeppelin

Jobs, Grafana, ELK

RESTAPI

Version 0.3.0 just released!

Python frst

Conda Repo

Project Conda env

Search

Install/Remove

Python-3.6, pandas-1.4,Numpy-0.9

Environment usable by Spark/Tensorfow

Hops python library: Make development easy● Hyperparameter searching● Manage Tensorboard lifecycle

Find big datasets - Dela*

● Discover, Share and experiment with interesting datasets

● p2p network of Hops Cluster● ImageNet, YouTube8M, Reddit comments...● Exploits unused bandwidth

*http://ieeexplore.ieee.org/document/7980225/ (ICDCS 2017)

Scale out level: 1Parallel Hyper parameter searching

Parallel Hyperparameter searching

def model(lr, dropout):…

args_dict = {'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]}

args_dict_grid = util.grid_params(args_dict)

tflauncher.launch(spark, model, args_dict_grid)

Starts 6 parallel experiments

Scale out Level: 2Distributed Training

TensorFlowOnSpark (TFoS) by Yahoo!

● Distributed TensorFlow over Spark● Runs on top of a Hadoop cluster● PS/Workers executed inside Spark executors● Uses Spark for resource allocations

– Our version: exclusive GPUs allocations– Parameter server(s) do not get GPU(s)

● Manages Tensorboard

Run TFoS

def training_fun(argv, ctx):

TFNode.start_cluster_server()

TFCluster.run(spark, training_fun, num_exec, num_ps…)

Full conversion guide: https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion-Guide

Scale out level: Master of the dark artsHorovod

PS server architecture doesn’t scale

From: https://github.com/uber/horovod

Horovod by Uber

● Based on previous work done by Baidu

● Organize workers in a ring● Gradients updates distributed using All-Reduce

● Synchronous protocol

All-Reduce

a0 b0 c0

a1 b1 c1

a2 b2 c2

All-Reduce

a0 b0 c0 + c2

a0 + a1 b1 c1

a2 b1 + b2 c2

All-Reduce

a0 b0 + b1 + b2 c0 + c2

a0 + a1 b1 c0 + c1 + c2

a0 + a1 + a2 b1 + b2 c2

All-Reduce

a0 b0 + b1 + b2 c0 + c2

a0 + a1 b1 c0 + c1 + c2

a0 + a1 + a2 b1 + b2 c2

All-Reduce

a0 + a1 + a2 b0 + b1 + b2 c0 + c2

a0 + a1 b0 + b1 + b2 c0 + c1 + c2

a0 + a1 + a2 b1 + b2 c0 + c1 + c2

All-Reduce

a0 + a1 + a2 b0 + b1 + b2 c0 + c1 + c2

Hops AllReduce

import horovod.tensorflow as hvddef conv_model(feature, target, mode) …..def main(_): hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..]

…..from hops import allreduceallreduce.launch(spark, 'hdfs:///Projects/…/all_reduce.ipynb')

Demo time!

Play with it → hops.io/?q=content/hopsworks-vagrant

Doc → hops.ioStar us! → github.com/hopshadoopFollow us! → @hopshadoop

Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs...

Documents

Transcript of Scaling Deep Learning to 100s of GPUs on Hops Hadoop€¦ · Scaling Deep Learning to 100s of GPUs...

Sachin 100 100s

Zoren Hops Breweries

Hops, Hops and More Hops!– Hop Bursting: Late boil hops, last 15- 20 minutes – Whirl pool hops, after flame out, some brewers never boil a hop – Double and Triple dry hopping,

Avr 100s Inglese

Sterrad 100S - Installation Guide

Anthony Hops - nigc.gov

Festival of hops

Computer Networks - University of Washington · 2019-01-20 · Computer Networks 10. . . My computer () tde 3 hops Telefonica 4 hops Level3 6 hops pnw-gigapop 1 hop UW 3 hops NYC

Hops, Hops and More Hops! · hops from 15 years ago – This includes ALL hop focused styles of beers • Add later in the process – APA can be made with no bitteringadditions –

Lenovo ideapad 100S

Hops Riverton 4.1.16

Homegrown Hops

Hops Feasibility Study

Hops technology - braukon.debraukon.de/wordpress/wp-content/uploads/2018/10/braukon_hops... · Hops technology The full range of hop aroma BrauKon hops technology delivers brewers

Hops Group Brochure

HOPS Interventions

HOPS Anglia

1s, 10s, 100s, 1,000s

RK-100S Parameter Guide

Operation BTU-100S Manual