Post on 30-Mar-2020
1
The Reconfigurable Streaming Vector Processor (RSVPTM)
Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May,Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi†
Motorola Labs, Motorola, Schaumburg, ILThe Mitre Corporation, Bedord, MA†
2
Introduction
CoprocessorA special-purpose processing unit that assists the CPU in performing certain types of operations
Vector ArchitectureTraditionally used in supercomputersHeavily pipelined architecture that operates on vectors and matricesVector processors are machines built primarily to handle large scientific and engineering calculations
3
Introduction (Cont.)Multimedia Functions
Computationally intensiveData streamingApplications: Image/Video capturing, Handwriting recognition, Voice recognitionMany more portable/embedded applications
Streaming DataData is produced/acquired as a stream of elementsRelevant for a short period of timeUndergoes same set of computationHigh degree of spatial localityRelatively poor temporal localityData access patterns allow prefetching of data ahead of computation
4
Introduction to RSVP
RSVP is a vector coprocessor architectureAccelerates streaming data operationsTargets multimedia functionsRSVP Programming model
Streaming data description – Vector shape and sizeComputation description – Data Flow GraphsAbove descriptions are intuitive and independent of each otherMachine Independent
5
Motivation for RSVPVector Processing Architectures
In recent years, targeting multimedia rather than SupercomputerMMX extensions to Intel IA32 architectureAltiVec extensions to PowerPC architecture“Wide Word SIMD”
RISC like Load-store Programming modelWide fixed-sized vector registersLevel of abstraction low
RSVP better Architecture than “Wide Word SIMD”
Performance Gap between memory and processing
6
Streaming Vector Data Processing using RSVP Architecture
A coprocessor to operate synchronously with an existing host CPUA programming model that separates the description of data from computation
Data described by the location and shape in memoryComputation described by Data Flow Graph
7
Streaming Computation ModelDecoupled Operand FetchDeep Pipelining (function unit chaining)SIMD Processing
8
Streaming Computation ModelRSVP architecture utilizes a stream-oriented approach to vector ProcessingDecouples and overlaps data access and data processingDecoupled Operand Fetch
Vector stream units – independent load/store unitVSUs communicate with processing unit through interlocked FIFO queues
Deep PipeliningProcessing unit split into N-stage pipeline by chaining multiple function units togetherAllows higher clock frequencies and increased resource utilization
SIMD processingMultiple taps on each of the VSUsParallelism limited by resource limitation, algorithm characteristics, vector size
9
Describing Data
Vector processing handled by Vector Streaming Unit (VSU)Vector description consists of pointer to first element in each vector in memory and description of vector shapeShape of vector consists of three scalar values: Stride, Skip, Span
10
Vector Shape
11
Describing Computation Using Data Flow Graphs
Synchronous DFG language expresses vector operations in a machine independent mannerDependencies explicitly stated to facilitate parallel executionDFG Node description
Input Operands – Reference to previous nodes rather than named registers (Data Dependence)Operation to be performedMinimum precision of its output values
Iteration-to-Iteration dependenceTunnel nodes – source and sink of data flow same
Order dependenceSequential execution (linear DFG) should be matched by any parallel execution
12
Scheduling and Binary Compatibility
Binary form of linear DFG executed by RSVPLinear DFG may not be best suited for direct execution on a particular RSVP implementationUniversal Fat Binaries (UFB) provided
More than one binary form of DFGDFG Compiler creates themLinked list with linear DFG appearing lastThe binary form that first executes will be used
13
Quant Programming ExampleCompresses video images through quantization
14
Quant Programming Example – Data Flow Graph
15
Architecture
16
ArchitectureInput and output VSUs
Minimum 1 and 3 respectivelyMaximum 64VSUs handle all issues related to loading/storing data
DFGNo more than 256 nodes“reach back” no more than 63 nodes
Accumulator and scalar/tunnel registersMinimum 2 and 16 respectivelyMaximum 64
SchedulerConverts DFG to machine dependent form
ControlImplements machine dependent form of DFG
Function unitsShould support all operations allowed by DFG
17
ImplementationFirst Implementation of RSVP
Low cost, low power solutionFabricated in TSMC 0.18um CMOS Technology
Tool setCompiler, assembler, linker for ARMCompiler for RSVP linear DFG
AreaComparable to ARM9 host processorIn effect, doubling area resulted in greater than 2x increase in performance
Power DissipationComparable to ARM9 host processorClock gating done to reduce system power dissipation
18
First Implementation
19
Reconfigurability
Fabric and QueueReconfigurable interconnect20 links, each link can transfer 16-bits of data from its source to its destinationLinks can be reconfigured every cycle
Function units64 bits wide, sliced on 16-bit boundariesFully pipelined with result latching
20
Results – Kernel Speedups
Magnitude of speedup results from large effective Instruction width of RSVP ImplementationIssue Rate: ARM9 – 0.78 IPC, RSVP – 9 IPC
WWSIMD – 4 Instructions per cycle
21
Results – Application Speedups
22
Conclusion
RSVP architecture vector coprocessor/accelerator architectureimproves general purpose CPU performance on streaming data applicationsImproves time to market because of ease of programmabilitySpeedups for kernels and applications range from 2 to over 20 times that of a host processor alone
23
Thank you
Questions ???