The Reconfigurable Streaming Vector Processor...

The Reconfigurable Streaming Vector Processor (RSVPTM)

Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May,Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi†

Motorola Labs, Motorola, Schaumburg, ILThe Mitre Corporation, Bedord, MA†

Introduction

CoprocessorA special-purpose processing unit that assists the CPU in performing certain types of operations

Vector ArchitectureTraditionally used in supercomputersHeavily pipelined architecture that operates on vectors and matricesVector processors are machines built primarily to handle large scientific and engineering calculations

Introduction (Cont.)Multimedia Functions

Computationally intensiveData streamingApplications: Image/Video capturing, Handwriting recognition, Voice recognitionMany more portable/embedded applications

Streaming DataData is produced/acquired as a stream of elementsRelevant for a short period of timeUndergoes same set of computationHigh degree of spatial localityRelatively poor temporal localityData access patterns allow prefetching of data ahead of computation

Introduction to RSVP

RSVP is a vector coprocessor architectureAccelerates streaming data operationsTargets multimedia functionsRSVP Programming model

Streaming data description – Vector shape and sizeComputation description – Data Flow GraphsAbove descriptions are intuitive and independent of each otherMachine Independent

Motivation for RSVPVector Processing Architectures

In recent years, targeting multimedia rather than SupercomputerMMX extensions to Intel IA32 architectureAltiVec extensions to PowerPC architecture“Wide Word SIMD”

RISC like Load-store Programming modelWide fixed-sized vector registersLevel of abstraction low

RSVP better Architecture than “Wide Word SIMD”

Performance Gap between memory and processing

Streaming Vector Data Processing using RSVP Architecture

A coprocessor to operate synchronously with an existing host CPUA programming model that separates the description of data from computation

Data described by the location and shape in memoryComputation described by Data Flow Graph

Streaming Computation ModelDecoupled Operand FetchDeep Pipelining (function unit chaining)SIMD Processing

Streaming Computation ModelRSVP architecture utilizes a stream-oriented approach to vector ProcessingDecouples and overlaps data access and data processingDecoupled Operand Fetch

Vector stream units – independent load/store unitVSUs communicate with processing unit through interlocked FIFO queues

Deep PipeliningProcessing unit split into N-stage pipeline by chaining multiple function units togetherAllows higher clock frequencies and increased resource utilization

SIMD processingMultiple taps on each of the VSUsParallelism limited by resource limitation, algorithm characteristics, vector size

Describing Data

Vector processing handled by Vector Streaming Unit (VSU)Vector description consists of pointer to first element in each vector in memory and description of vector shapeShape of vector consists of three scalar values: Stride, Skip, Span

Vector Shape

Describing Computation Using Data Flow Graphs

Synchronous DFG language expresses vector operations in a machine independent mannerDependencies explicitly stated to facilitate parallel executionDFG Node description

Input Operands – Reference to previous nodes rather than named registers (Data Dependence)Operation to be performedMinimum precision of its output values

Iteration-to-Iteration dependenceTunnel nodes – source and sink of data flow same

Order dependenceSequential execution (linear DFG) should be matched by any parallel execution

Scheduling and Binary Compatibility

Binary form of linear DFG executed by RSVPLinear DFG may not be best suited for direct execution on a particular RSVP implementationUniversal Fat Binaries (UFB) provided

More than one binary form of DFGDFG Compiler creates themLinked list with linear DFG appearing lastThe binary form that first executes will be used

Quant Programming ExampleCompresses video images through quantization

Quant Programming Example – Data Flow Graph

Architecture

ArchitectureInput and output VSUs

Minimum 1 and 3 respectivelyMaximum 64VSUs handle all issues related to loading/storing data

DFGNo more than 256 nodes“reach back” no more than 63 nodes

Accumulator and scalar/tunnel registersMinimum 2 and 16 respectivelyMaximum 64

SchedulerConverts DFG to machine dependent form

ControlImplements machine dependent form of DFG

Function unitsShould support all operations allowed by DFG

ImplementationFirst Implementation of RSVP

Low cost, low power solutionFabricated in TSMC 0.18um CMOS Technology

Tool setCompiler, assembler, linker for ARMCompiler for RSVP linear DFG

AreaComparable to ARM9 host processorIn effect, doubling area resulted in greater than 2x increase in performance

Power DissipationComparable to ARM9 host processorClock gating done to reduce system power dissipation

First Implementation

Reconfigurability

Fabric and QueueReconfigurable interconnect20 links, each link can transfer 16-bits of data from its source to its destinationLinks can be reconfigured every cycle

Function units64 bits wide, sliced on 16-bit boundariesFully pipelined with result latching

Results – Kernel Speedups

Magnitude of speedup results from large effective Instruction width of RSVP ImplementationIssue Rate: ARM9 – 0.78 IPC, RSVP – 9 IPC

WWSIMD – 4 Instructions per cycle

Results – Application Speedups

Conclusion

RSVP architecture vector coprocessor/accelerator architectureimproves general purpose CPU performance on streaming data applicationsImproves time to market because of ease of programmabilitySpeedups for kernels and applications range from 2 to over 20 times that of a host processor alone

Thank you

Questions ???

The Reconfigurable Streaming Vector Processor...

Documents

Transcript of The Reconfigurable Streaming Vector Processor...

Extron SMP 111 User GuideUser Guide SMP 111 Streaming AV Product Streaming Media Processor 68-2850-01 Rev. C 11 17

Reconfigurable VLSI Communication Processor Architectures · Opportunities for Reconfigurable Accelerators for Communication Systems XCommonality of Algorithms, e.g. Viterbi Decoder,

Investigation of a Reconfigurable Processor for a SLAM System

CYPRIS An Application Specific Reconfigurable Processor Michael Stebnisky Lockheed Martin Advanced Technology Laboratories.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.

A fully reconfigurable photonic integrated signal processorjpyao/mprg/reprints/NPHOT-201… · · 2015-12-02A fully reconfigurable photonic integrated signal processor ... report

The SKA Project - The World's Largest Streaming Data Processor

High Performance, Low Power Reconfigurable Processor for Embedded Systems

1 Chapter 7 Processor-in-Memory, Reconfigurable, and Asynchronous Processors.

A VLIW processor with reconfigurable instruction set for ...users.ece.utexas.edu/~adnan/comm-08/embedded... · A VLIW Processor With Reconfigurable Instruction Set for Embedded Applications

Reconfigurable Silicon Photonic Processor Based on SCOW ... · Reconﬁgurable Silicon Photonic Processor Based on SCOW Resonant Structures Volume 11, Number 6, December 2019 Liangjun

Paper Review: XiSystem - A Reconfigurable Processor and System David Hermann ENGG*6090: Reconfigurable Computing Systems University of Guelph.

A Reconfigurable Streaming Processor for Real-Time Low ...

An Ultra Low Power Reconfigurable Task Processor for Space

Low Overhead Fault-Tolerance Technique for Dynamically Reconfigurable Softcore Processor

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Graph Streaming Processor - Hot Chips€¢Graph Streaming Processor –Intermediate buffers are small (~1% of the original size) –Data is more easily cached –Benefits of significantly

Exploiting Processor Heterogeneity Through Reconfigurable ...

Swinger: Processor Relocation on Dynamically Reconfigurable FPGAs

DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for ... · PDF file14.2: DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks © 2017 IEEE