Mean-shift Clustering for Heterogeneous Architecturescerin/VILLETANEUSE2017/Foutse_project... ·...

Post on 27-Jul-2020

3 views 0 download

Transcript of Mean-shift Clustering for Heterogeneous Architecturescerin/VILLETANEUSE2017/Foutse_project... ·...

Mean-shift Clustering forHeterogeneous Architectures

Foutse Yuehgoh

Journee MAGI Calcul scientifiqueVILLETANEUSE, 2017

July 5, 2017

Foutse Yuehgoh (AIMS) July 5, 2017 1 / 32

OUTLINE

Introduction

1 Generality

2 Heterogeneous Architecture

3 Our Approach

Conclusion

Foutse Yuehgoh (AIMS) July 5, 2017 2 / 32

Introduction

Context

Heterogeneous computing system.

Parallel programing paradigm of Machine learning (ML).

Need of specific construction languages to configure and reconfigurehardware according to the machine learning needs.

Foutse Yuehgoh (AIMS) July 5, 2017 3 / 32

Introduction

Problematic

Slow hardware designs.

Slow testing and evaluation time.

Slow and expensive fabrication.

Design cost dominate.

Foutse Yuehgoh (AIMS) July 5, 2017 4 / 32

Introduction

Objective

Performance optimizations for heterogeneous machine learningcomputing.

Foutse Yuehgoh (AIMS) July 5, 2017 5 / 32

Generality

Machine Learning

Building hardware or software that can achieve tasks such as:

Extracting features from data in order to explore or solve predictiveproblems,

Automatically learn to recognize complex patterns and makeintelligent decisions based on insight generated by learning fromexamples.

Foutse Yuehgoh (AIMS) July 5, 2017 6 / 32

Generality

Continues increase in data size

Consequences

Driven by the need for more explanatory power, tremendous pressure hasbeen on ML methods to scale beyond a single machine, due to both spaceand time bottlenecks.

Foutse Yuehgoh (AIMS) July 5, 2017 7 / 32

Generality

Complexity of algorithms

Their learning time grows as the size and complexity of the trainingdataset increases.

Fortunately, they have special properties which, if properly harnessed by awell-designed system, can yield a 10× or even 100× speed boost.

Foutse Yuehgoh (AIMS) July 5, 2017 8 / 32

Generality

Importance

Efficient computational methods and algorithms

On very large data sets,

in parallel to complete the machine learning tasks in reasonable time.

Processing vast amounts of data simultaneously, is important in neuralnetworks and machine learning algorithms, just as it is the case for thehuman brain.

Foutse Yuehgoh (AIMS) July 5, 2017 9 / 32

Generality

Review of results

Speeding up multidimensional clustering in statistical data analysis hasbeen the focal point of numerous papers that have employed differenttechniques to accelerate the clustering algorithms.

Example: Pipelined Euclidean distance hardware diagram

Foutse Yuehgoh (AIMS) July 5, 2017 10 / 32

Generality

Constructing Hardware in a Scala Embedded Language (Chisel)

Chisel eases the building of simple, reliable, and efficient software toharness the parallelism inherent in FPGA technology.

Foutse Yuehgoh (AIMS) July 5, 2017 11 / 32

Heterogeneous Architecture

Definition

A heterogeneous architecture is a novel architecture that incorporatesboth general-purpose and specialized cores on the same chip.

Takes care of generic control and computation;

Accelerates frequently used or heavy weight applications.

Challenge

Foutse Yuehgoh (AIMS) July 5, 2017 12 / 32

Heterogeneous Architecture

Example

(a) CBEA: Traditional CPU core and eight SIMD accelerator cores.

(b) GPU: 30 highly multi-threaded SIMD accelerator cores incombination with a standard multicore CPU.

(c) FPGA: An array of logic blocks in combination with a standardmulti-core CPU.

Foutse Yuehgoh (AIMS) July 5, 2017 13 / 32

Heterogeneous Architecture

Serial and Parallel Computing

Parallel computing enables us to solve problems that benefits from orneed faster solutions and require large amount of memory.

Foutse Yuehgoh (AIMS) July 5, 2017 14 / 32

Heterogeneous Architecture

Hardware difficulties

Foutse Yuehgoh (AIMS) July 5, 2017 15 / 32

Our Approach

Strategy

Direct implementation of the algorithm on FPGA development board.

How do we split and partition the ML program over a heterogeneousmachines, and which best language do we use?

Foutse Yuehgoh (AIMS) July 5, 2017 16 / 32

Our Approach

MEAN-SHIFT ALGORITHM

Goal: Produce clusters on input data.

Fixes a window around each data point;

Computes the mean of the data within the window;

Shifts the window to the mean and repeat until we have convergence.

Key Property

An exceptionally appealing and non-parametric clustering techniques thatdeal without prior notion on the number of clusters.

Foutse Yuehgoh (AIMS) July 5, 2017 17 / 32

Our Approach

Foutse Yuehgoh (AIMS) July 5, 2017 18 / 32

Field Programmable Gate Array (FPGA)

Altera Cyclone IV E FPGA DE2-115 Development kid

Foutse Yuehgoh (AIMS) July 5, 2017 19 / 32

Field Programmable Gate Array (FPGA) :

Key Properties

Foutse Yuehgoh (AIMS) July 5, 2017 20 / 32

Why FPGA?

Advantageous properties

Foutse Yuehgoh (AIMS) July 5, 2017 21 / 32

Important properties of Chisel

Foutse Yuehgoh (AIMS) July 5, 2017 22 / 32

Why Chisel?

Maximum of a vector

Foutse Yuehgoh (AIMS) July 5, 2017 23 / 32

Why Chisel?

Manual Code parallelization

Foutse Yuehgoh (AIMS) July 5, 2017 24 / 32

Why Chisel?

Chisel Code parallelization

Foutse Yuehgoh (AIMS) July 5, 2017 25 / 32

Running the Chisel Simulation

Generated Verilog

Foutse Yuehgoh (AIMS) July 5, 2017 26 / 32

Challenges encountered

Errors in Chisel

Type mismatches, syntax, (Scala compiler errors).

Errors found during low level transforms and verilog emission (Firrtlerrors).

Illegal chisel but legal scala (Chisel checks).

Underlying implementation crashed(Java Stack Trace).

Foutse Yuehgoh (AIMS) July 5, 2017 27 / 32

Advantages with Chisel for hardware implementation [1]

Important advantages of Chisel

Designer productivity with 3-stage 32-bit RISC processor reviles a 3×code reduction with Chisel compared to hand written Verilog.

Quality of results:

Foutse Yuehgoh (AIMS) July 5, 2017 28 / 32

Advantages with Chisel for hardware implementation[1]

Important advantages of Chisel

Simulator Speed with a (88, 291, 350 cycles total) Processor:

Foutse Yuehgoh (AIMS) July 5, 2017 29 / 32

Conclusion

Chisel breaks the barrier between hardware and software thus is agood tool for hardware design.

Rather than slowing-down data sender (during fast arrival of data inchunks), an FPGA allows you to add more processing ”blocks” toaccelerate the processing.

An FPGA, for example, can process 10 data streams in parallel (i.econcurrently), whereas in software each stream would have to bebuffered and processed one at a time.

Hence this approach is worth giving it a try!!!

Foutse Yuehgoh (AIMS) July 5, 2017 30 / 32

Some References

Bachan, John Constructing Hardware in a Scale Embedded Language,institution: Lawrence Berkeley National Laboratory, (2014)

Bailey, Donald G and Johnston, Christopher T Algorithmtransformation for FPGA implementation, booktitle: ElectronicDesign, Test and Application, 2010. DELTA’10. Fifth IEEEInternational Symposium on,pages:77–81, organization:IEEE(2010).

Han, Song and Kang,and others ESE: Efficient Speech RecognitionEngine with Sparse LSTM on FPGA, organization:ACM,(2017).

Liu, Shaoshan and RO, WON W and Liu, Chen and Cristobal-Salas,Alfredo and Cerin Christophe and Han, Jian-Jun and Gaudiot,Jean-Luc,INTRODUCING THE EXTREMELY HETEROGENEOUSARCHITECTURE,publisher: World Scientific(2012)

Foutse Yuehgoh (AIMS) July 5, 2017 31 / 32

Mean Shift, Heterogeneous Architecture (FPGA), Chisel

Foutse Yuehgoh (AIMS) July 5, 2017 32 / 32