Improving Scalability of CMPs with Dense ACCs Coverage

Nasibeh Teimouri, Hamed Tabkhi and Gunar Schirner

Improving Scalability of CMPs

with Dense ACCs Coverage

Embedded System Lab. (ESL)

Department of Electrical and Computer Engineering

Northeastern University, Boston (MA), USA

Context:

• Embedded performance-demanding streaming applications

• Vision, software-

defined radio, multimedia

• Heterogeneous implementation to meet performance demands under

stringent constraints

• Accelerator-based Chip Multi-Processors (ACMPs)Interrupt Line

ACC-based CMP

P1 P2 P3 P4 P5

P6 P7 P8 P9

P10P0

2

− Trends (among others):

− Increasing ACC coverage

− Increasing density

− Adjacent nodes in HW

Shared

Memory

Control/Streaming Communication Fabric

DMADMADMAACC 0

SPM

ACC 1

SPM

ACC 2

SPM

ACC 3

SPM

ACC 4

SPM

ACC-based CMP

Host

Processor

Challenges with Denser ACCs Coverage

• Processor-centric view– System orchestration by processor

– Processor becomes bottleneck

• High contention on shared resources

– Memory: local/shared data

– System Communication Fabric:

ACC-to-ACC traffic

Communication Fabric

ACC 0

SPM

ACC 1

SPM

Shared Memory

2 5

3 6

ACC-based CMP

Host

Processor

DM

ADM

ADM

A

14

3

– Processor

• Unclear ACC comm. semantic

– Rely on processor interaction

• Scalability severely limited with denser

ACCs coverage [DAC’15]

� System bottlenecks

� ACCs underutilized

Problem Formulation and Contribution

1. Define semantics of ACC communication / interaction

• Foundation for direct ACC-to-ACC communication

2. Transparent Self-Synchronizing (TSS) Architecture

template

4

template

• Realizes semantics

• Mitigate system bottlenecks

• Peer view between processor and ACCs

Outline

• Trend: Increasing ACC Coverage and Density

– Motivation and challenges

• Problem Definition

• ACC-to-ACC Communication Semantics

• Transparent Self-Synchronizing (TSS) Architecture Template

• Experimental Results

• Conclusions

5

ACC Communication Aspects

• Orchestration by processor

• Data preparation according to the ACC job size, data type

• Synchronization of the ACCs and DMA

• DMA data transfer from/to ACC’s memory through communication fabric

Done data copy

• Which aspects need to be defined?

1. Granularity of processing:

ACC job size?

2. Data access model:

When and which memory region accessible?

3. Marshaling / Data Representation:

Adjust data type for ACC input

4. Synchronization:

Start, stop, flow control?

6

ACC_C

DMADMADMAHost

Processor

Streaming/Control Communication Fabric

Shared Memo ry

FIFO/Random Access

Processing Done

DMA Config (size, addr) ACC Config

DMA copying

type/granularity adjusted data

I0 I1

O0

O1

B us IF

Comp

ACC Communication Semantic

• Synchronization / Control

– Initializing ACC for each computation and managing FIFO access

• Synchronization signals “Iready”, “Oread” and “Finished”

• Data access model

– Double buffering

– More general: FIFO with head/tail Random Access (RA)

• Granularity and marshaling management

– Data type/size adjustment of input/output of input/output data– Data type/size adjustment of input/output of input/output data

7

P C

Data Flow M odel

ACCP ACCC

Or chestration

Syn

chp

Gra

nu

lari

tyM

arsh

ali

ng

Syn

chc

IReady

Finishe d

ORead

Finishe d

RA

FIF O

RA

FIF O

ACC Communic at ion Semant ic

I0O0

- All semantic aspects currently involve processor!

- Even for ACC-to-ACC communication

Outline







• Conclusions

8

TSS: ACC-to-ACC Communication

• Separation of computation and communication

– Input Control Mgmt (ICM) and Output Control Mgmt (OCM)

• Efficient realization of the comm. semantics

– Data access (I/O buffer)

– Synchronization, data granularity management

– Data marshalling

• Local interconnect across the ACCs

– Hides ACC-to-ACC traffic from system bus– Hides ACC-to-ACC traffic from system bus

9

ACCp

Processing

OReady

ORead Syn

chG

ran

ula

rity

O0

O1

Mars

hali

ng

OCMP

ACCcIReady

ICMC

Mars

hali

ng

Sy

nch

Gra

nu

lari

ty

I0

I1

IRead

Ma

rsha

lin

gProcessing

Interc

onn

ect

P0P2

P1

P4

P3

P6

P5P8

P7

TSS: Interconnect Network

• Interconnection network

– Many options: MUX, NoC, Bus

– Full connectivity not needed

(only feasible connections)

– Depends on domain

ACCP

Processing

OReady

ORead Syn

chG

ran

ula

rity

O0

O1

Mars

hali

ng

OCMP

ACCCIReady

ICMC

Mars

halin

g

Sy

nch

Gra

nu

lari

ty

I0

I1

IRead

Ma

rsha

lin

g

Processing

Inter

con

nec

t

• Current choice: MUX based interconnect

10

• Current choice: MUX based interconnect

– Simplicity

– Parallelism

ACC0In DataFlow0

In DataFlow 1

ICM0 OCM0

ACC1ICM1 OCM1

ACC2ICM2 OCM2

ACC3

Mu

x0

Mu

x1

Mu

x2

ICM3 OCM3

ACC4ICM4 OCM4

ACC5ICM5 OCM5

ACC6

Mu

x3

Mu

x4

Mu

x5

ICM6 OCM6

ACC7ICM7 OCM7

ACC8ICM8 OCM8

Out DataFlow 0

Out DataFlow 1

SEL0

SEL1

SEL2

SEL3

SEL4

SEL5

TSS: System Integration and Benefits

• Gateway

• Interface to system for each

flow/stream (ACC chain)

• Configuration & control

• Granularity adjustment

• Benefits

Gateway

SE

L 0

SE

L 1

SE

L 2

Cont/Conf Unit

OCM0

(Fl ow 0)

OCM0

(Fl ow 1)

OCM0

(Flow 2)

ICM0

(Flow 0)

ICM0

(Flow 1)

ICM0

(Fl ow 2)

SPM (output)O0 O1 O0 O1 O0 O1 I0 I1 I0 I1 I0 I1

SPM (Input)

Bus Interface MMR

Data to/from

Shared Memory

From

Processor

Int to

Processor

• Benefits

• Each ACC chain appears as

one ACC to processor

• Hides all internals

• Much smaller internal granularity

• Minimal as per ACC’s algorithm

• Reduces on-chip memory

11

Shared

Memory

Control/Streaming Communication Fabric

Inter rupt L ine

Mux

2

Gateway

TSS

Host

Pr oc essor

SPM

Mu

x1

ACC0

Mu

x0

Mu

x5

Mu

x4

Mu

x3

Mu

x8

Mu

x7

Mu

x6

OC

M0

ICM

0

ACC1

OC

M1

ICM

1

ACC2

OC

M2

ICM

2

ACC3

OC

M3

ICM

3

ACC4

OC

M4

ICM

4

ACC5

OC

M5

ICM

5

ACC6

OC

M6

ICM

6

ACC7

OC

M7

ICM

7

ACC8

OC

M8

ICM

8

CU

Outline







• Conclusions

12

Experimental Setup

• Compare: Processor-centric ACMP, TSS

– Same HW / SW Mapping

– Impact of architecture on performance?

• 8 streaming applications (SDF3)

– H263Dec, H263Enc, MP3dec, MP3PB, Sam.Rate, Modem, Synthetic,

Satellite

• ISS-based (OVP) Virtual platforms• ISS-based (OVP) Virtual platforms

– Automatically generated

– 2MB total on-chip mem

13

Virtual Platform Settings

Processor -ARM9 /500MH

-OS : UCOS II

Communication

Fabric

-Multi-layer AMA-AHB (32-bit)

-Freq: 200MHz

-Dedicated DMA per channel

Memory - 2 MB

ACCs -Double-buffered

-Freq: 200MHz

TSS over ACMP: System Performance and Memory Saving

• Average speedup: 3 times

– Minimize interaction with the processor

• 1/7th of orchestration demand

– Self-synchronization (OCM/ICM)

– Reduces system load

• 1/7th (avg) of on-chip memory

– Smaller internal job size

• 1/10th of traffic on system fabric

14

• 1/10th of traffic on system fabric

– ACC-to-ACC comm. fabric.

• 1/8th energy consumption

– Fewer off-chip access

– Smaller on-chip mem.

Outline







• Conclusions

15

Conclusions • Defined semantic aspects of ACC communication

– Synchronization

– Data access model

– Data granularity

– Data representation / marshalling

• Introduced architecture template

Transparent Self-Synchronizing (TSS)

– Efficient realization of semantics

• Separation computation / communication with ICM/OCM

• Internal interconnect network

• Adjustable internal granularity (through gateway)

– Each ACC chain regardless of length appears as one ACC

• Illustrated architecture benefits (processor-centric vs. TSS)

– 8 streaming apps (SDF3) mapped to ISS-based VPs

– 3x speedup (at 1/8th energy consumption) with same HW/SW mapping

16

Thank you!Thank you!

17

Improving Scalability of CMPs with Dense ACCs Coverage

Documents

Transcript of Improving Scalability of CMPs with Dense ACCs Coverage