The Reconfigurable Streaming Vector Processor...

Post on 30-Mar-2020

4 views 0 download

Transcript of The Reconfigurable Streaming Vector Processor...

1

The Reconfigurable Streaming Vector Processor (RSVPTM)

Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May,Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi†

Motorola Labs, Motorola, Schaumburg, ILThe Mitre Corporation, Bedord, MA†

2

Introduction

CoprocessorA special-purpose processing unit that assists the CPU in performing certain types of operations

Vector ArchitectureTraditionally used in supercomputersHeavily pipelined architecture that operates on vectors and matricesVector processors are machines built primarily to handle large scientific and engineering calculations

3

Introduction (Cont.)Multimedia Functions

Computationally intensiveData streamingApplications: Image/Video capturing, Handwriting recognition, Voice recognitionMany more portable/embedded applications

Streaming DataData is produced/acquired as a stream of elementsRelevant for a short period of timeUndergoes same set of computationHigh degree of spatial localityRelatively poor temporal localityData access patterns allow prefetching of data ahead of computation

4

Introduction to RSVP

RSVP is a vector coprocessor architectureAccelerates streaming data operationsTargets multimedia functionsRSVP Programming model

Streaming data description – Vector shape and sizeComputation description – Data Flow GraphsAbove descriptions are intuitive and independent of each otherMachine Independent

5

Motivation for RSVPVector Processing Architectures

In recent years, targeting multimedia rather than SupercomputerMMX extensions to Intel IA32 architectureAltiVec extensions to PowerPC architecture“Wide Word SIMD”

RISC like Load-store Programming modelWide fixed-sized vector registersLevel of abstraction low

RSVP better Architecture than “Wide Word SIMD”

Performance Gap between memory and processing

6

Streaming Vector Data Processing using RSVP Architecture

A coprocessor to operate synchronously with an existing host CPUA programming model that separates the description of data from computation

Data described by the location and shape in memoryComputation described by Data Flow Graph

7

Streaming Computation ModelDecoupled Operand FetchDeep Pipelining (function unit chaining)SIMD Processing

8

Streaming Computation ModelRSVP architecture utilizes a stream-oriented approach to vector ProcessingDecouples and overlaps data access and data processingDecoupled Operand Fetch

Vector stream units – independent load/store unitVSUs communicate with processing unit through interlocked FIFO queues

Deep PipeliningProcessing unit split into N-stage pipeline by chaining multiple function units togetherAllows higher clock frequencies and increased resource utilization

SIMD processingMultiple taps on each of the VSUsParallelism limited by resource limitation, algorithm characteristics, vector size

9

Describing Data

Vector processing handled by Vector Streaming Unit (VSU)Vector description consists of pointer to first element in each vector in memory and description of vector shapeShape of vector consists of three scalar values: Stride, Skip, Span

10

Vector Shape

11

Describing Computation Using Data Flow Graphs

Synchronous DFG language expresses vector operations in a machine independent mannerDependencies explicitly stated to facilitate parallel executionDFG Node description

Input Operands – Reference to previous nodes rather than named registers (Data Dependence)Operation to be performedMinimum precision of its output values

Iteration-to-Iteration dependenceTunnel nodes – source and sink of data flow same

Order dependenceSequential execution (linear DFG) should be matched by any parallel execution

12

Scheduling and Binary Compatibility

Binary form of linear DFG executed by RSVPLinear DFG may not be best suited for direct execution on a particular RSVP implementationUniversal Fat Binaries (UFB) provided

More than one binary form of DFGDFG Compiler creates themLinked list with linear DFG appearing lastThe binary form that first executes will be used

13

Quant Programming ExampleCompresses video images through quantization

14

Quant Programming Example – Data Flow Graph

15

Architecture

16

ArchitectureInput and output VSUs

Minimum 1 and 3 respectivelyMaximum 64VSUs handle all issues related to loading/storing data

DFGNo more than 256 nodes“reach back” no more than 63 nodes

Accumulator and scalar/tunnel registersMinimum 2 and 16 respectivelyMaximum 64

SchedulerConverts DFG to machine dependent form

ControlImplements machine dependent form of DFG

Function unitsShould support all operations allowed by DFG

17

ImplementationFirst Implementation of RSVP

Low cost, low power solutionFabricated in TSMC 0.18um CMOS Technology

Tool setCompiler, assembler, linker for ARMCompiler for RSVP linear DFG

AreaComparable to ARM9 host processorIn effect, doubling area resulted in greater than 2x increase in performance

Power DissipationComparable to ARM9 host processorClock gating done to reduce system power dissipation

18

First Implementation

19

Reconfigurability

Fabric and QueueReconfigurable interconnect20 links, each link can transfer 16-bits of data from its source to its destinationLinks can be reconfigured every cycle

Function units64 bits wide, sliced on 16-bit boundariesFully pipelined with result latching

20

Results – Kernel Speedups

Magnitude of speedup results from large effective Instruction width of RSVP ImplementationIssue Rate: ARM9 – 0.78 IPC, RSVP – 9 IPC

WWSIMD – 4 Instructions per cycle

21

Results – Application Speedups

22

Conclusion

RSVP architecture vector coprocessor/accelerator architectureimproves general purpose CPU performance on streaming data applicationsImproves time to market because of ease of programmabilitySpeedups for kernels and applications range from 2 to over 20 times that of a host processor alone

23

Thank you

Questions ???