Seminar_New -CESG

24
1 Hardware Implementation of Cascade Support Vector Machine Qian Wang, Peng Li and Yongtae Kim Texas A&M University 3/6/2015

Transcript of Seminar_New -CESG

Page 1: Seminar_New -CESG

1

Hardware Implementation of Cascade Support Vector Machine

Qian Wang, Peng Li and Yongtae KimTexas A&M University

3/6/2015

Page 2: Seminar_New -CESG

2

Outline

Motivation

Support Vector Machine

– Basic Support Vector Machine

– Cascade Support Vector Machine

– Hardware Architecture of Cascade SVM

– Experimental results

Relevant Works in Our Group

– Memristor-based Neuromorphic Processor

– Liquid State Machine

Page 3: Seminar_New -CESG

3

Everything is becoming more and more data-intensive:

• Bioinformatics researchers often need to process tens of billions points of data.

• The world’s quickest radio telescope is collecting up to 360 TB of data per day.

• Wearable devices processes the data obtained from our bodies every day.

What can we do with the “Big Data” ?• Machine learning from a large set of data to reveal relationships, dependencies and

to perform predictions of outcomes and behaviors;

• The obtained predictive model is used to interpret and predict new data.

Human Genome Project Astronomy Research Smart Healthcare Devices Big Data Market

Page 4: Seminar_New -CESG

4

“Curiosity rover” on Mars Speech Recognition Social Networks Bioinformatics

Machine Learning (Mitchell 1997)– Learn from past experiences to improve the performance of a certain task

– Applications of Machine learning:

– Integrating human expertise into Artificial Intelligence System;

– It enables “Mars rovers” to navigate themselves;

– Speech Recognition;

– Extracting hidden information from complex large data sets

– Social media analysis; Bioinformatics;

Page 5: Seminar_New -CESG

5

Challenges

Machine Learning Applications on General-purpose CPU:

• Takes a huge amount of CPU time (e.g. several weeks or even months).

• Very high energy consumption.

SOFTWARE SIMULATION

Page 6: Seminar_New -CESG

6

A specific task: Y = AX2 + BX +C

5-bit fixed point numbers

Program :

VS

CPUDedicated Hardware

(assume the same Clock rate)

Our Solutions– A dedicated VLSI hardware design is usually much more time and

energy-efficient than general purpose CPUs

Not limited by Instruction Set;

Necessary functional logics for specific tasks;

No need of Instruction memory (program codes);

Fully exploit hardware parallelism

Page 7: Seminar_New -CESG

7

Application Specific Integrated Circuit (ASIC) Field Programmable Gate Array (FPGA)

Dedicated Hardware Designs

Speed

Power

Area

Software Algorithms

Reconfigurability Potential Parallelism Reusability

Scalability Hardware Friendly Algorithm Binary Arithmetic's (Precision)

Storage OrganizationAnalog-to-Digital ConversionMemory Access Styles

Resilience Various interesting features of the ML algorithm to be realized in HW

How do we design hardware?

Page 8: Seminar_New -CESG

8

Publications Support Vector Machine

– [TVLSI’14] Qian Wang, Peng Li and Yongtae Kim, “A parallel digital VLSI architecture for integrated support vector machine training and classification,” in IEEE Trans. on Very Large Scale Integration Systems.

Spiking Neural Network– [IEEENano'14] *Qian Wang, *Yongtae Kim and  Peng Li, “Architectural

design exploration for neuromorphic processors with memristive synapses,” In Proc. of the 14th Intl. Conf. on Nanotechnology, August 2014.

– [IEEETNANO’14] *Qian Wang, *Yongtae Kim and Peng Li, “Neuromorphic Processors with Memristive Synapses: Synaptic Crossbar Interface and Architectural Exploration” (Under Review)

– [TVLSI’15] *Qian Wang, *Youjie Li, *Botang Shao, *Siddharta Dey and Peng Li, “Energy Efficient Parallel Neuromorphic Architectures with Approximate Arithmetic on FPGA” (Under Review)

Page 9: Seminar_New -CESG

9

Outline

Motivation

Support Vector Machine

– Basic Support Vector Machine

– Cascade Support Vector Machine

– Hardware Architecture of Cascade SVM

– Experimental results

Relevant Works in Our Group

– Memristor-based Neuromorphic Processor

– Liquid State Machine

Page 10: Seminar_New -CESG

10

x1

x2

wT x + b = 0

Support Vector Machine (SVM)

MarginBasic idea: To construct a separating hyper-plane, where the margin of separation between “+” and “-” samples are maximized.

𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒∑𝑖=1

𝑛

𝛼𝑖−12∑𝑖=1

𝑛

∑𝑗=1

𝑛

𝛼 𝑖𝛼 𝑗 𝑦 𝑖 𝑦 𝑗𝐾 (𝑥 𝑖 ,𝑥 𝑗)

C and =0

𝑘 (𝑥𝑖 , 𝑥 𝑗 )=¿𝜙 (𝑥 𝑖 ) ,𝜙 (𝑥 𝑗)>¿

𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒Φ (𝑤 ,𝜉 )=12‖𝑤‖2

+𝐶∑𝑖=1

𝑛

𝜉 𝑖

S .t.Method of Lagrange multipliers

A learning and classification algorithm successfully applied to a wide range of real-world pattern recognition problems

Support Vectors

Separating Hyperplane

ClassifyFuture input

vectors

“+”

“-”

Page 11: Seminar_New -CESG

11

x1

x2

x1

x2

Support Vector

Machine(Training)

Labeled samples

“ Filtering Process! ”

x1

x2

wT x + b = 0

Margin

Support Vector

Machine(Testing)

x1

x2Unlabeled samples

Accurate predictions

Kernel Method: between any of 2 training samples. During SVM training, if there are n samples, the total number of kernel calculations is n2!

Page 12: Seminar_New -CESG

12

Cascade SVM

SVM SVM SVM SVM

SVM SVM

SVM

SV1 SV2 SV3 SV4

SV SV

SV

D1 D2 D3 D4

Di: i-th data setSV: support vectors

Original large data set

[ H. P. Graf, Proc. Adv. Neural Inf. Process. Syst., 2004 ]

Training process of basic SVM– SVM training is time consuming:

Dominated by kernel evaluations;O(n2) time complexity;

Parallel SVM (Cascade)– Parallel processing of multiple smaller sub

data sets – Partial results are combined in 2nd 3rd layer

workload in 2nd &3rd layers is small.

Global Convergence:– Feed the 3rd layer result to 1st layer to check

the KKT conditions.– The samples violating KKT conditions will

join the next round of optimization.

Amdahl’s law:– Significant Speedup can be achieved if the

runtime of the 1st layer dominates;

Page 13: Seminar_New -CESG

13

Array of basic SVM units;

Distributed Cache Memories;

Multi-layer System Bus;

Global FSM as Controller;

– Critical issues for the detailed implementation: How to use moderate number of SVMs to construct HW architecture?

How to make efficient use of on-Chip memories?

Flexibility of each SVM unit in processing variable sized data sets

Configure differently to tradeoff between Power, Area and Throughput;

Overall HW Architecture

𝑦 𝑖 ,𝛼 𝑖 ,𝑥 𝑖Binary OperandsMEM MEM MEMMEM

MEM MEM MEMMEM

SVM SVM SVMSVMSVM SVM SVMSVM

Global Controller

SVM SVM SVM

Read/write interface, Address mapping control

MEM MEM MEM

SVM

MEM

Multi-layer System Bus

SVM Array

DistributedMemory

Page 14: Seminar_New -CESG

14

How to use moderate number of SVMs to construct HW architecture?

SVM

SV1 SV2 SV3 SV4

SV12 SV34

SVM SVM SVM

SVM SVM

SVM

Software data flow of a Cascade SVM

We should fully exploit the concept of HW Reusability !

The 7 SVMs are not working simultaneously !

D1 D2 D3 D4

• We implement 4 SVMs to perform 1st layer training:• D1~D4 stored in distributed memories.• SVMs access their private memories in parallel.

SVM SVM SVM SVM

D1 D2 D3 D4

• For the 2nd layer, just reuse 2 of the 4 SVMs. But how can they find SV1 U SV2 or SV3 U SV4?

SVM SVM SVM SVM

SV1 SV2 SV3 SV4

• Considering , we simply need to enable each “reused SVM” to access multiple memory blocks:

Page 15: Seminar_New -CESG

15

MEM

x(1)

SVM

MMUy

Results

SVM

MMU

Results

MMU

SVM

MMU

Results

MMU

(a) 1st layer

(b) 2nd layer

x(2)

SVM

MMU

Results

MMU MMU MMU

MEM MEM MEM MEM

(c) 3rd layer

MEM

x(1)

SVM

MMUy

Results

x(2)

MEM

x(1)

SVM

MMUy

Results

x(2)

MEM

x(1)

SVM

MMUy

Results

x(2)

x(1)

y

x(2)

x(1)

y

x(2)

x(1)

y

x(2)

new new newnew

new new

newMEM MEM MEM MEM

Data flow of the HW architecture

D1 D2 D3 D4

D1 D2 D3 D4

D1 D2 D3 D4

SVM

SV1 SV2 SV3 SV4

SV12 SV34

SVM SVM SVM

SVM SVM

SVM

D1 D2 D3 D4

Software data flow of a Cascade SVM

• D1~D4 stored in MEM1 ~ MEM4;

• Implement 1st layer SVMs with HW, and reuse them for the following layers;

• Training results saved in MMU (will explain)

• The final data flow is illustrated by the figure to the right:

How to use moderate number of SVMs to construct HW architecture?

Page 16: Seminar_New -CESG

16

ABCDEFGH

# of SVs : 50x000000

0x000001

0x000002

0x000003

0x000004

0x000005

0x000006

0x000007

A

BC

D

1346

E8

# of SVs : 3047

F

G

H

0x0000000x0000010x0000020x0000030x0000040x0000050x0000060x0000070x000008

0x0000000x0000010x0000020x0000030x0000040x0000050x0000060x0000070x000008

Virtual Address Space Physical Address Space

Continuous addresses from one SVM unit

Support VectorIndex tables inside MMUs

Physical addresses from two separate

SRAMs

MMU (a)

MMU (b)

SRAM (a)

SRAM (b)

MMU (Memory Management Unit)– Record the address of each SV;– Perform the “address mapping” to help

the reused SVM to locate the SVs;

How to make efficient use of on-Chip memories?The target is to “identify” SVs in the original data set, so we just need to record their locations in the memory. Don’t duplicate and save them to additional storage space.

SVM

MEM

MMU

yx(1)

x(2)

α αnew

result

SVM

MEM

MMU

yx(1)

x(2)

α αnew

result

SVM

MEM

MMUyx(1)

x(2)

α αnew

result

MEM

MMU

αnew

1st layer Parallel Training (MMUs record SV addresses)

2nd layer Partial Results Combination(MMUs perform “Address Mapping” )

D1 D2

Page 17: Seminar_New -CESG

17

Implementation of Multi-layer System Bus– According to the data flow explained earlier, we want:

– to reuse SVM units for different layers of Cascade SVM;– to make a reused SVM to access the data stored in multiple memory blocks;

– A multilayer system bus is required to support all the necessary data transmissions.

Page 18: Seminar_New -CESG

18

Design of Flexible SVM unit– Single SVM unit might be reused for different layers of the Cascade Tree;

– It should be capable of processing variable sized data sets;

– To apply Serial Processing Scheme for Kernel Calculation;

Memory

Address Generator

yj

xi(1)

xj(1)

xi(2)

xj(2)

Sub

Sub

( )2

( )2

AddLUT

-1

yi32 bit

Multiplier Add Reg

-1Sub

1

LocalFSM

{0, C}

0

3N-1

3N

4N-1

||||||||

y

x(1)

x(2)

Nij

address

dataout

datain

kij

Comp

i

sram

j

Implementation Details– Gaussian Kernel

– 32 bit fixed-point arithmetic's

Page 19: Seminar_New -CESG

19

Classification & KKT check– Formulas have a very similar

form with training algorithm;

– We can reuse the logics in SVM units to reduce area overhead;

MEM MEM MEMMEM

AMP

SVMAddress

Indices of Support Vectors

Indices of KKT

violators

Indices of Support Vectors

Indices of Support Vectors

Indices of Support Vectors

AMP AMP AMP

Indices of KKT

violators

Indices of KKT

violators

Indices of KKT

violators

Indices of KKT

violators

𝛼𝑖=0 → 𝑦 𝑖 (∑𝑗=1

𝑁

𝛼 𝑗 𝑦 𝑗𝐾 ( �⃗� 𝑗 ,𝑥 𝑖))≥ 1

0 ≤𝛼 𝑖≤𝐶→ 𝑦 𝑖 (∑𝑗=1

𝑁

𝛼 𝑗 𝑦 𝑗𝐾 ( �⃗� 𝑗 , �⃗� 𝑖 ))=1

𝛼𝑖=𝐶→ 𝑦 𝑖(∑𝑗=1

𝑁

𝛼 𝑗 𝑦 𝑗𝐾 (𝑥 𝑗 , �⃗�𝑖))≤ 1

400Samples

Without Feedback One FeedbackRuntime Accuracy Runtime Accuracy

Flat SVM 0.394s 98% unnecessary2-Core 0.104s 94.25% 0.120s 98%4-Core 32.8ms 92.50% 37.55ms 98%8-Core 13.9ms 89.75% 16.13ms 98%

The KKT violators still have a chance to get back to the optimization !!!

𝑓 (�⃗� )=∑𝑖=1

𝑁 𝑠𝑣

𝛼𝑠𝑣 𝑦𝑠𝑣 𝐾 ( �⃗� , �⃗�𝑠𝑣)

{𝑖 𝑓 𝑓 ( �⃗� )>0 , h𝑡 𝑒𝑛+𝑖𝑓 𝑓 ( �⃗� )<0 , h𝑡 𝑒𝑛−

The address information of KKT violators will be recorded in MMUs :

Impact of the feedback on the training accuracy and runtime.

Page 20: Seminar_New -CESG

20

Experimental Results– Synthesized using a commercial 90nm CMOS standard cell library;

– On-Chip memories generated by corresponding SRAM compiler;

– Layout generated using the same library, measure the area, power and maximum clock frequency (178MHz).

Decision boundary obtained from training 400 2-D samples.

The 8-core design including I/O pads6.68mm2

Page 21: Seminar_New -CESG

21

200 Samples

P(mW)

Area (um2)

Speed Energy Reduction

Flat SVM 15.52 373,518 1x 1x2-Core 27.74 727.946 3.67x 2.05x4-Core 64.43 1,499,828 10.54x 2.54x8-Core 126 3,143,700 28.79x 3.54x

Experimental Results

Energy = Runtime x Power

50 100 150 200 250 300 350 40010

-4

10-3

10-2

10-1

100

Number of training samples

Run

time

(s)

1-core SVM2-core SVM4-core SVM8-core SVM

50 100 150 200 250 300 350 40010

-5

10-4

10-3

10-2

Number of training samples

Ene

rgy

(J)

1-core SVM2-core SVM4-core SVM8-core SVM

As number of cores increases:– Power & Area are “linearly” increased

– Speedup is increased much faster

Datasets of different sizes to evaluate performance of each HW design

Focus on a fixed dataset

Page 22: Seminar_New -CESG

22

Flat SVM (1-Core)

Temporal Reuse (1-

Core)

Fully Parallel (2-Core)

Hybrid (2-Core)

0

1

2

3

4

5

6

7

8Core Area (um2)

Power (mW)

Speedup (1x)

Subset 1 Subset 1 Subset 3

SVM1 SVM2SVM

(a) temporal reuse of one SVMSubset 2

MemorySubset 2 Subset 4

Memory1 Memory2

MMU1 MMU2 MMU1 MMU2 MMU3 MMU4

SVM1 SVM2

SVM5

SVM3 SVM4

SVM6

SVM7

SVM1 SVM2

SVM3

Subset 1 Subset 2 Subset 1 Subset 2 Subset 3 Subset 4

(b) temporal reuse of two SVMs

We can configure the flexible architecture in different ways:

1. Full Parallel Processing; Reuse SVMs for different layers

2. Temporal reuse of SVM unit; Reuse SVMs within same layer

Due to O(n2) of Kernel evaluation, we can still get about 2x speedup !

Integrating “Temporal Reuse Scheme” into Cascade SVM HW

It will introduce a small area/power overhead. It will introduce a further speedup .

A new angle for the tradeoffs between speed and hardware cost !

Page 23: Seminar_New -CESG

23

• Even the Intel CPU has a higher Clock frequency, and uses a more advanced technology, our ASIC designs can still outperform it by a lot!

C++ SVM program Intel Pentium T4300 (2.1GHz) (45nm)

ASIC designs of Cascade SVMs (178MHz) (90nm)

VS

Comparison of Runtimes and Energy Consumption

Software Approach and Hardware Approach

Page 24: Seminar_New -CESG

24

Thank you! Questions?