Mean-shift Clustering for Heterogeneous Architecturescerin/VILLETANEUSE2017/Foutse_project... ·...
Transcript of Mean-shift Clustering for Heterogeneous Architecturescerin/VILLETANEUSE2017/Foutse_project... ·...
Mean-shift Clustering forHeterogeneous Architectures
Foutse Yuehgoh
Journee MAGI Calcul scientifiqueVILLETANEUSE, 2017
July 5, 2017
Foutse Yuehgoh (AIMS) July 5, 2017 1 / 32
OUTLINE
Introduction
1 Generality
2 Heterogeneous Architecture
3 Our Approach
Conclusion
Foutse Yuehgoh (AIMS) July 5, 2017 2 / 32
Introduction
Context
Heterogeneous computing system.
Parallel programing paradigm of Machine learning (ML).
Need of specific construction languages to configure and reconfigurehardware according to the machine learning needs.
Foutse Yuehgoh (AIMS) July 5, 2017 3 / 32
Introduction
Problematic
Slow hardware designs.
Slow testing and evaluation time.
Slow and expensive fabrication.
Design cost dominate.
Foutse Yuehgoh (AIMS) July 5, 2017 4 / 32
Introduction
Objective
Performance optimizations for heterogeneous machine learningcomputing.
Foutse Yuehgoh (AIMS) July 5, 2017 5 / 32
Generality
Machine Learning
Building hardware or software that can achieve tasks such as:
Extracting features from data in order to explore or solve predictiveproblems,
Automatically learn to recognize complex patterns and makeintelligent decisions based on insight generated by learning fromexamples.
Foutse Yuehgoh (AIMS) July 5, 2017 6 / 32
Generality
Continues increase in data size
Consequences
Driven by the need for more explanatory power, tremendous pressure hasbeen on ML methods to scale beyond a single machine, due to both spaceand time bottlenecks.
Foutse Yuehgoh (AIMS) July 5, 2017 7 / 32
Generality
Complexity of algorithms
Their learning time grows as the size and complexity of the trainingdataset increases.
Fortunately, they have special properties which, if properly harnessed by awell-designed system, can yield a 10× or even 100× speed boost.
Foutse Yuehgoh (AIMS) July 5, 2017 8 / 32
Generality
Importance
Efficient computational methods and algorithms
On very large data sets,
in parallel to complete the machine learning tasks in reasonable time.
Processing vast amounts of data simultaneously, is important in neuralnetworks and machine learning algorithms, just as it is the case for thehuman brain.
Foutse Yuehgoh (AIMS) July 5, 2017 9 / 32
Generality
Review of results
Speeding up multidimensional clustering in statistical data analysis hasbeen the focal point of numerous papers that have employed differenttechniques to accelerate the clustering algorithms.
Example: Pipelined Euclidean distance hardware diagram
Foutse Yuehgoh (AIMS) July 5, 2017 10 / 32
Generality
Constructing Hardware in a Scala Embedded Language (Chisel)
Chisel eases the building of simple, reliable, and efficient software toharness the parallelism inherent in FPGA technology.
Foutse Yuehgoh (AIMS) July 5, 2017 11 / 32
Heterogeneous Architecture
Definition
A heterogeneous architecture is a novel architecture that incorporatesboth general-purpose and specialized cores on the same chip.
Takes care of generic control and computation;
Accelerates frequently used or heavy weight applications.
Challenge
Foutse Yuehgoh (AIMS) July 5, 2017 12 / 32
Heterogeneous Architecture
Example
(a) CBEA: Traditional CPU core and eight SIMD accelerator cores.
(b) GPU: 30 highly multi-threaded SIMD accelerator cores incombination with a standard multicore CPU.
(c) FPGA: An array of logic blocks in combination with a standardmulti-core CPU.
Foutse Yuehgoh (AIMS) July 5, 2017 13 / 32
Heterogeneous Architecture
Serial and Parallel Computing
Parallel computing enables us to solve problems that benefits from orneed faster solutions and require large amount of memory.
Foutse Yuehgoh (AIMS) July 5, 2017 14 / 32
Heterogeneous Architecture
Hardware difficulties
Foutse Yuehgoh (AIMS) July 5, 2017 15 / 32
Our Approach
Strategy
Direct implementation of the algorithm on FPGA development board.
How do we split and partition the ML program over a heterogeneousmachines, and which best language do we use?
Foutse Yuehgoh (AIMS) July 5, 2017 16 / 32
Our Approach
MEAN-SHIFT ALGORITHM
Goal: Produce clusters on input data.
Fixes a window around each data point;
Computes the mean of the data within the window;
Shifts the window to the mean and repeat until we have convergence.
Key Property
An exceptionally appealing and non-parametric clustering techniques thatdeal without prior notion on the number of clusters.
Foutse Yuehgoh (AIMS) July 5, 2017 17 / 32
Our Approach
Foutse Yuehgoh (AIMS) July 5, 2017 18 / 32
Field Programmable Gate Array (FPGA)
Altera Cyclone IV E FPGA DE2-115 Development kid
Foutse Yuehgoh (AIMS) July 5, 2017 19 / 32
Field Programmable Gate Array (FPGA) :
Key Properties
Foutse Yuehgoh (AIMS) July 5, 2017 20 / 32
Why FPGA?
Advantageous properties
Foutse Yuehgoh (AIMS) July 5, 2017 21 / 32
Important properties of Chisel
Foutse Yuehgoh (AIMS) July 5, 2017 22 / 32
Why Chisel?
Maximum of a vector
Foutse Yuehgoh (AIMS) July 5, 2017 23 / 32
Why Chisel?
Manual Code parallelization
Foutse Yuehgoh (AIMS) July 5, 2017 24 / 32
Why Chisel?
Chisel Code parallelization
Foutse Yuehgoh (AIMS) July 5, 2017 25 / 32
Running the Chisel Simulation
Generated Verilog
Foutse Yuehgoh (AIMS) July 5, 2017 26 / 32
Challenges encountered
Errors in Chisel
Type mismatches, syntax, (Scala compiler errors).
Errors found during low level transforms and verilog emission (Firrtlerrors).
Illegal chisel but legal scala (Chisel checks).
Underlying implementation crashed(Java Stack Trace).
Foutse Yuehgoh (AIMS) July 5, 2017 27 / 32
Advantages with Chisel for hardware implementation [1]
Important advantages of Chisel
Designer productivity with 3-stage 32-bit RISC processor reviles a 3×code reduction with Chisel compared to hand written Verilog.
Quality of results:
Foutse Yuehgoh (AIMS) July 5, 2017 28 / 32
Advantages with Chisel for hardware implementation[1]
Important advantages of Chisel
Simulator Speed with a (88, 291, 350 cycles total) Processor:
Foutse Yuehgoh (AIMS) July 5, 2017 29 / 32
Conclusion
Chisel breaks the barrier between hardware and software thus is agood tool for hardware design.
Rather than slowing-down data sender (during fast arrival of data inchunks), an FPGA allows you to add more processing ”blocks” toaccelerate the processing.
An FPGA, for example, can process 10 data streams in parallel (i.econcurrently), whereas in software each stream would have to bebuffered and processed one at a time.
Hence this approach is worth giving it a try!!!
Foutse Yuehgoh (AIMS) July 5, 2017 30 / 32
Some References
Bachan, John Constructing Hardware in a Scale Embedded Language,institution: Lawrence Berkeley National Laboratory, (2014)
Bailey, Donald G and Johnston, Christopher T Algorithmtransformation for FPGA implementation, booktitle: ElectronicDesign, Test and Application, 2010. DELTA’10. Fifth IEEEInternational Symposium on,pages:77–81, organization:IEEE(2010).
Han, Song and Kang,and others ESE: Efficient Speech RecognitionEngine with Sparse LSTM on FPGA, organization:ACM,(2017).
Liu, Shaoshan and RO, WON W and Liu, Chen and Cristobal-Salas,Alfredo and Cerin Christophe and Han, Jian-Jun and Gaudiot,Jean-Luc,INTRODUCING THE EXTREMELY HETEROGENEOUSARCHITECTURE,publisher: World Scientific(2012)
Foutse Yuehgoh (AIMS) July 5, 2017 31 / 32
Mean Shift, Heterogeneous Architecture (FPGA), Chisel
Foutse Yuehgoh (AIMS) July 5, 2017 32 / 32