Apache Horn (Incubating)a Large-scale Deep Learning Platform
Edward J. Yoon @eddieyoonOct 15, 2015 @ R3 Diva-Hall, Samsung Electronics
I am ..● Member of Apache Software Foundation● PMC member and committer, or Mentor of
○ Apache Incubator, ○ Apache Hama, Apache Horn, Apache MRQL, ○ and Apache Rya, Apache BigTop.
● Cloud Tech Lab, Software R&D Center.○ HPC Cloud (Network Analysis, ML & DNN)
What’s Apache Horn?
Horn [hɔ:n]: 얼(혼) 魂 = Mind
● Horn is a clone project of Google’s DistBelief, supports both data and model parallelism.○ Apache Incubator Project (Since Sep 2015)○ 9 initial members are from Samsung Electronics, Microsoft, Cldi Inc,
LINE plus, TUM, KAIST, …, etc.
Google’s DistBelief● GPUs are expensive, both to buy and to rent. ● Most GPUs can only hold a relatively small amount of data in
memory and CPU-to-GPU data transfer is very slow. ○ Therefore, the training speed-up is small when the model
does not fit in GPU memory.
● DistBelief is a framework for training deep neural networks that avoids GPUs-only approach (for the above reasons) and solves the problems with a large number of examples and dimensions (e.g., high-resolution images).
Google’s DistBelief
● It supports both Data and Model Parallelism○ Data Parallelism: The training data is partitioned
across several machines each having its own replica of the model. Each model trains with its partition of the data in parallel.
○ Model Parallelism: The layers of each model replica are distributed across machines.
DistBelief: Basic ArchitectureEach worker group performs minibatch in BSP paradigm, and interacts with Parameter Server asynchronously.
What’s BSP?● Bulk Synchronous Parallel
It was developed by Leslie Valiant of Harvard University during the 1980s.
● Iteratively:a. Local Computationb. Communication (Message Passing)c. Global Barrier Synchronization
DistBelief: Batch Optimization
Coordinator 1) finds stragglers (slow tasks) for better load balancing and resource usage. It similar to Google MapReduce’s “Backup Tasks” 2) reduces communication overheads between the central Parameter Server and workers something like Aggregators.
As a result:● CPU cluster to train deep networks significantly faster
than a GPU, w/o limitation on the max size of model.○ CPU cluster is 10x faster than a GPU.
● Trained a model with over 1 billion parameters to achieve better than state-of-the-art performance on ImageNet challenge.
Nov 2012: IBM simulates 530 billion neurons, 100 trillion synapses * 1,572,864 processor cores, 1.5 PB memory, and 6,291,456 threads.
Wait, .. Why do we need this?● Deep learning is likely to spur other applications beyond
speech and image recognition in the nearer term. ○ e.g., medicine, manufacturing, and transportation.
and, it’s a Closed Source Software● We needs to solve size matters (training set and the
size of neural networks), but many OSS such as Caffe, DeepDist, Spark MLlib, Deeplearning4j, and NeuralGiraph are data or model parallel only.
● So, we started to clone the Google’s DistBelief, called Apache Horn (Incubating).
The key idea of implementation
● .. is to use existing OSS distributed systems○ Apache Hadoop: Distributed File System, Resource
Manager.○ Apache Hama: general-purpose BSP computing
engine on top of Hadoop, which can be used for Both data-parallel and graph-parallel in flexible way.
Apache Hama: BSP framework
BSP frameworkon Hama or YARN
Hadoop HDFS
Task 1 Task 2 Task 3 Task N...
Like MapReduce, Apache Hama BSP framework schedules tasks according to the distance between the input data of the tasks and request nodes.
BSP tasks are globally synchronized after performing computations on local data and communication actions.
Global Regional Synchronization
BSP frameworkon Hama or YARN
Hadoop HDFS
Task 1
Task 2Task 3
Task 4
Like MapReduce, Apache Hama BSP framework schedules tasks according to the distance between the input data of the tasks and request nodes.
All tasks within the same group are synchronized with each others. Each group works asynchronously as independent BSP job.
...Task 6
Task 5
Async mini-batches using Regional Synchronization
BSP frameworkon Hama or YARN
Hadoop HDFS
Task 1
Task 2Task 3
Task 4
Like MapReduce, Apache Hama BSP framework schedules tasks according to the distance between the input data of the tasks and request nodes.
...
Task 5
Task 6
Each group performs minibatch in BSP paradigm, and interacts with Parameter Server asynchronously.
Parameter Swapping
Parameter Server Parameter Server
BSP frameworkon Hama or YARN
Hadoop HDFS
Task 1
Task 2Task 3
Task 4
Like MapReduce, Apache Hama BSP framework schedules tasks according to the distance between the input data of the tasks and request nodes.
...
Task 5
Task 6
One of group works as a Coordinator
Each group performs minibatch in BSP paradigm, and interacts with Parameter Server asynchronously.
Parameter Swapping
Async mini-batches using Regional Synchronization
Parameter Server Parameter Server
Neuron-centric Programming APIs
User-defined neuron-centric programming APIs:
The activation and cost functions computes the propagated information, or error messages and sends its updates to Parameter Server (but not fully designed yet).
Similar to Google’s Pregel.
Job Configuration APIs /* * Sigmoid Activation Function */ public static class Sigmoid extends ActivationFunction { public double apply(double input) { return 1.0 / (1 + Math.exp(-input)); } }
... public static void main(String[] args) { ANNJob ann = new ANNJob();
// Initialize the topology of the model ann.addLayer(int featureDimension, Sigmoid.class, int numOfTasks); ann.addLayer(int featureDimension, Step.class, int numOfTasks); ann.addLayer(int featureDimention, Tanh.class, int numOfTasks); …
ann.setCostFunction(CrossEntropy.class); ..}
Job Submission Flow
BSP framework onApache Hama or YARN
clusters
Task 1
Task 4
Task 7
Task 2 Task 3
Task 5 Task 6
Task 8 Task 9
Parameter Server
Parameter Server
Parameter Swapping
One of worker group works as a Coordinator
Hadoop HDFS
Data Parallelism
Model Parallelism
Apache Horn
Client and Web UIUser’sANN Job
Horn Community● https://horn.incubator.apache.org/● https://issues.apache.org/jira/browse/HORN● Mailing lists
Top Related