A Low-Energy Heterogeneous Reconfigurable DSP IC

8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC

http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 1/10

A Low-Energy Heterogeneous Reconfigurable DSP IC

1 INTRODUCTION

The advent of the third generation of wireless

applications creates a need for digital signal processing

platforms that simultaneously display high computational

performance, ultra low-energy consumption and a high

degree of flexibility and adaptability. The flexibility and

adaptability is a necessity in the presence of multiple and

evolving standards, and helps to increase quality-of-

service in the presence of dynamically evolving channel

conditions. (Re)configurable processors offer the

advantage of combining flexibility and low-energy [1][2]by providing a direct spatial mapping from algorithm to

architecture, hence reducing the control overhead

typically associated with instruction-set processors.

A low power reconfigurable DSP architecture template

(Pleiades) which encapsulates heterogeneous computing

elements has been proposed [2][3] to solve the problem of

meeting the requirement of flexibility, speed and energy

efficiency at the same time (Figure. 1). The Pleiades

architecture style echoes the current trend in system-on-

a-chip design which includes a wide variety of

macromodules including core processors, DSPs,

programmable logic, embedded memory, and custom

modules [4]. The heterogeneous architecture style of

Pleiades allows better algorithm-architecture matching

,giving better power/performance than many

heterogeneous reconfigurable processors which

incorporate only a microprocessor and fine-grained

FPGAs.

In this paper, we describe the design process and

implementation results of an instance of the Pleiades

architecture , Maia, targetting the speech coding domain.

In section 2, we give a description of the Pleiades

architecture template and the model of reconfiguration.

In section 3 and 4, the methodology used to mapalgorithms to an architecture is given and the

implementation of the architecture is discussed. Section 5

reports the testing strategy for the design and results of

the final chip.

2 HETEROGENEOUS

RECONFIGURABLE DSP

Reconfigurable architectures [5][6][7] have received

significant attention in recent years in both the general

purpose computing as well as embedded processing.Mixing processor with fine-grain reconfigurable elements

has been the main approach attempted by the above

systems. The Pleaides reconfigurable architecture

achieves low energy consumption by providing a

computational platform with mixed programming

granularity (i.e. microprocessor, reconfigurable dataflow,

FPGA) [8]. In this section, we explain our architecture in

concept, and provide a description of the reconfiguration

and computation models used in our design methodology.

2.1 Architecture Template

The Pleiades architecture (Figure. 2) is composed of a

programmable microprocessor and heterogeneous

computing elements (referred to as satellites in the rest of

the paper). The architecture template fixes the

communication primitives between the microprocessor

clocks

data data

flags

controlsignals

flags

handshake handshake

control

ASIC or FPGA

configregister

Reconfigurable Interconnect

A Satellite

module

Microprocessor

Reconfiguration Bus

SAT1 SAT2 SAT3

SATConfiguration

SAT

Figure 2. Heterogeneous Architecture Template

Energy-Efficiency

Flexibility

µPDSPASIC Pleiades

Figure 1. Energy and Flexibility Spectrum for Different

Architectures



and satellites and between each satellite. For each

algorithm domain (communication, speech coding, video

coding), an architecture instance can be created (with

known satellite types and numbers)

To reduce overhead in terms of instruction fetch and glo-

bal control, the architecture utilizes distributed control

and configuration. To achieve distributed control, each

satellite is equipped with an interface that enables it to

exchange data streams with other satellites efficiently,

without the help of a global controller. The

communication mechanism between each satellite is

dataflow driven [9].

The control means available to the programmer are basic

satellite configurations to specify the kind of operation to

be performed by the satellite, and configurations for the

reconfigurable interconnect to build a cluster of satellites.

2.2 Model of Computation and

Reconfiguration

While multiple threads of application can be run on an

instance of the Pleiades architecture template, the

compilation of a single thread down to the reconfigurable

components is the main core of the higher level

scheduling tools that can utilize multi-threads. Therefore,

the design methodology described later in the paper aims

to support a smooth transition from a single thread

algorithm to an optimized implementation on Pleiades.

Figure. 3 illustrates the flow of computation supported bythis software methodology. As shown in the figure, a

sequential thread is first initialized on the

microprocessor. After configuration codes are executed

on the processor, the control is transferred to the Pleiades

reconfigurable satellites (the “split” point in Figure.3)

and the computation is returned back to the processor

after all satellite operations are finished (the “join”

point). Multiple split points exist within a seqeuntial

thread and the satellites and connections have to be

reconfigured for each split point.

The main idea behind reconfigurable computing that is

advocated by the Pleiades system is to build a computa-

tional engine through spatially-programmed connections

of processing elements (satellites). The interconnect

model that needs to support such a system is depicted in

Figure. 4. On the time axis, t0, t1 and t2 indicate the time

of reconfiguration. The bars (C1, C2 etc.) in-between two

reconfiguration times represent a set of inter-satellite

connections that has to be realized simultaneously by the

reconfigurable interconnect.

3 OVERVIEW OF THE

ARCHITECTURE DESIGN

METHODOLOGY

There are two key issues to be resolved in order to make

the methodology practical to the designers. Firstly, the

architecture combines two very distinct models of

computation, control-driven computation on the general-

purpose microprocessor and data-driven computing on

the clusters of satellites. Therefore, the goal of the

architectural exploration process is to partition the

application over these two computing paradigms so that

performance and energy dissipation constraints are met

(during the compilation process). Secondly,

optimizations related to reconfigurability have to be

supported at both the architecture design as well as com-

pilation stage. Both of these issues requires careful

modeling of the algorithm and the underlying

heterogeneous architectures.

The basic flow of the design exploration methodology

[10] is presented in Figure. 5. After the introduction of

terminology, a short overview of the overall flow is given

join

split

Application Thread1

Thread2Thread3

on programmable processor

on satellites

Figure 3. Flow of Computation on Pleiades

AG AG

MEM MEM

MA C /MUL

AG AG

MEM MEM

MAC/ MU L

AG

MEM

ALU

3 Address Generators, 3 Memories, 1 MAC/MUL and 1 ALUArchitecture Instance:

C1 C2

C3 C4

C5

C’8

C’1

reconfiguration

TimeC1

C2

C’1

C’4

C’8

t1 t2t0

i n t e r c o

n n e c t

p a t t e r n

:C5

:

Figure 4. Model of Reconfiguration



in section 3.1.1. A more detailed description of this

methodology and tools developed can be found in [11].

Definition of Kernel - A computational intensive part of

the algorithm that often resides in nested loops.

3.1.1 Basic Methodology Flow

The methodology flow takes DSP or communication

algorithms specified in a high-level language (e.g. C) as

input. The initiation of the design process requires the

establishment of a first-order baseline model of the

algorithm complexity and bottlenecks. Such a model

allows for the selection and execution of architecture-

independent optimizations (stage 1). As architectural

choices have yet to be made, this model assumes the

presence of a “virtual architecture” with some generic

operator costs attached to it. Optimizations at this stage

only address either win-only situations or order-of-

magnitude improvements, so that absolute accuracy is notthat important.

After a satisfactory algorithm formulation is obtained, the

architectural mapping and partitioning process can be

entered. To be meaningful, the partitioning process

should be based on realistic bottom-up information

regarding the cost of implementing functions and

operations on the different architectural choices. Our

design-exploration methodology relies extensively on the

availability of power-delay models for all components in

its architectural library (stage 2). The estimation methods

employed in each of these models vary depending upon

the type of the module and the desired accuracy. While

the absolute accuracy of these characterizations is not

crucial, it is important that bounds on the prediction

accuracy are known. Only “Improvements” that fall

within the noise level of the estimations are accepted.

The architecture partitioning and mapping process is

started by establishing an initial solution. Given the

implementation simplicity of a pure software

implementation, we have adopted a “software-centric”

approach that assumes that the whole algorithm is

initially mapped onto the microprocessor (stage 3). Thisestablishes how close such a solution adheres to the

design specifications and helps to establish the design

bottlenecks. A rank ordering of the dominant compute

kernels is established. In stage 4, dominant kernels are

evaluated in order of importance and mapped to satellites

for better power and performance [12]. If a hardware

implementation is deemed worthwhile, a repartitioning of

the design is established.

After all costly kernels are mapped to accelerators, a final

partition of the algorithm across different architectures is

obtained (stage 5) and memory assignment and allocation

is performed to minimized memory trasfers. While the

rest of the algorithm remains as high- level language, the

portions of the algorithm to be implemented by satellites

are specified in an intermediate form that is capable of

modeling the structure of the reconfigurable satellite

operations (i.e. as a netlist). Based on this conceptual

netlist, implementation optimizations (stage 6) [13] are

invoked to choose a good reconfigurable interconnect

architecture (during architecture design path) and to

generate efficient configuration and interface code

(during compilation and test vector generation path) [14].

Applications

Architecture

Specification

Algorithm

Refinement Characterization

Microprocessors(s)

Satellites

Mapping to Core

Kernel Ranking

Mapping to accelerators

PDA macro-model

Performance Evaluation

PDA

models

Exploration

Kernel

timing, power constraintsstage 1.

stage 3.

stage 2.

stage 4.

stage 5.

Partitioning

Compilation/Code Generation

InterconnectOptimization

Reconfig. Hardware

stage 6.

ImplementationOptimization

Figure 5. The Software Methodology

Flow



small local instruction memory, and can be programmed

to support various types of addressing patterns and nested

loops with loop counters and stride counters. It behaves

as the local controller of data-flow kernels by initiating

the data-flow threads, and by signaling the end of the

data-flow threads to the ARM8.

4.1.3 Embedded FPGA

Commercial FPGAs are often notorious for their energy

consumption and most of them can not be embedded in a

system-on-a-chip. Therefore, we make use of an in-house

low-energy embedded FPGA [15].

The embedded FPGA contains a 4×8 array of 5-input 3-

output CLBs, optimized for arithmetic operations and

data-flow control functions. Its energy-efficiency has

been measured to be 70 times higher than equivalentcommercial solutions. This energy efficient FPGA design

is realized by combining both architectural and circuit

level modifications which are outlined below.

Logic block The logic block is designed to improve

the interconnect utilization, and hence the interconnect

energy. It is made up of a cluster of 3 input look-up-

tables. It can be used to implement 5 input random logic,

or 2 bit arithmetic operations.

Low-swing circuit Low-swing interconnect circuit

improves the energy by a factor of 2 as compared to a full

swing circuit. The logic blocks operate on 1.5V while the

low-swing signal lines have a 0.8V swing.Interconnect Architecture The interconnect is made

up of 3 levels of connectivity. Each level is targeted at

providing low energy connections for specific path

lengths. The Level0 structure is targeted at connections

between nearest neighbors. Each logic block can connect

to 8 of it’s immediate neighbors. The Level1 structure is

the traditional symmetric mesh architecture, and is good

for intermediate length wires. The Level3 structure is

used for implementing connections that span a

significant fraction of the chip. The connectivity of each

of these structures has been optimized using architecture

evaluation tools to obtain energy efficiency.Clock Distribution More than 80% of the clock

energy is dissipated in the clock distribution network.

Double-edge-triggered Flip-Flops are used to reduce the

clock activity by factor of 2, and hence a proportional

reduction in energy. The clock distribution network also

uses the low-swing technique for energy reduction.

4.2 Communication Interface Description

4.2.1 Inter- satellites Communication Interface

The data-flow driven synchronization between the

processing elements employs a 2-phase self-timed

handshaking scheme with REQUEST and

ACKNOWLEDGE signals (Figure 7a), realized in a

globally asynchronous locally synchronous

implementation fashion. This approach not only reduces

power consumption by ensuring that a module is only

activated when data is ready, but also allows variousmodules to operate at different and dynamically varying

rates. Data links combine 16-bit fixed-width data words

with 2-bit control tokens that serve as tags for different

data structures (scalar, vector, or matrix) that are

supported by the network (Figure 7b). Each module

includes a network interface controller to coordinate

communication and synchronization.

4.2.2 Communication Interface between the

Microprocessor and Satellites

This interface control unit coordinates synchronization

and communication between the synchronous ARM8 core

and the asynchronous reconfigurable data-paths, most

importantly helping the core perform the reconfiguration

of satellites by mapping all the configuration memories to

the ARM8 memory space.

In

Reqin

Clk

Enable Done

Processor

Module

Out

Reqin

Reqoutclk

Clk Done

delay R e c o n f i g u r a b l e

N e t w o r k

1

1

1

1

n

nMPY MPY

n

n

1MACData associated with an end-of-vector token

Regular data

(a) Globally asynchronous - locally synchronous signaling

(b) Control tokens differentiate and delineate data streams and data structures (scalar,

vector, matrix)

Figure 7. Data-flow driven globally synchronous locally

asynchronous communication protocal



The interface logic controls the strobe generation for

configuration reads/writes, handshakes, network reset,

start requests for the address generators and IO ports.

The acknowledge signals for the address generators and

IO ports are used to detect the end of kernel and the

ARM8 core is interrupted. Interrupt mask registers and

control registers are used to synchronize ARM8 with theasynchronous satellite array.

The system supports two modes of operation: TEST and

SYSTEM modes. As part of the test strategy, the TEST

mode allows us to bypass the ARM8 processor and

execute individual kernels through the interface. In the

SYSTEM mode, instead of an on-chip cache for the

embedded ARM8, an external SRAM (with zero bus

turnaround) serves as the memory for the processor. In

order to meet the 40MHz performance for the

application, the off-chip memory is clocked twice as fast

as the core. The interface is designed to meet this

bandwidth.

4.3 Reconfigurable Interconnect

Architecture

Keeping the energy of the reconfigurable interconnect

network as low as possible while still meeting the

flexibility requirement is crucial to the success of out

approach of heterogeneous reconfigurable architecture.

This is realized by a combination of architecture and

circuit optimizations.

4.3.1 Hierarchical Interconnect Network Architecture

Energy-efficient architecture must take advantages of the

locality and regularity of computation. Exploiting locality

by identifying natural isolated clusters of operations, can

be used to guide hardware partitioning resulting in the

minimization of global busses, thus reducing the

interconnect power. Although the underlying system is

heterogeneous, the DSP algorithms usually have

inherently repetitive computation patterns. Partitioning

the hardware by preserving such regularity will lead to

simpler interconnect structure with reduced fan-ins and

fan-outs. Especially for reconfigurable architectures,

more regular interconnect architecture achieve better

routability and less reconfiguration overhead. There is

trade-off between flexibility and energy-efficiency. For

instance, the crossbar network has the most flexibility,

but also the least energy efficiency. In stage 6 of the

design methodology, cross-bar, mesh and hierarchical

mesh structures are evaluated, and a 2-level hierarchical

mesh is decided for this implementation.

The implemented hierarchical interconnect mesh

network can provide the optimum energy-efficiency with

right degree of flexibility within the application domain

of interest. Several clusters of tightly connected modules

are formed based on the communication locality. Each

cluster has a local mesh with 2 buses per channel, and a

universal switchbox at every intersection point (Figure

6). Global interconnections are supported by a 2nd

level

larger-granularity mesh (implemented on the higher

metal layers) with 2 buses per channel and hierarchical

switchboxes, located at the key connection points. The

hierarchical switchbox (Figure 6) contains a universal

switchbox for each mesh-level, as well as a number of

cross-level interconnect switches. This hierarchical

network architecture requires only a limited number of

buses to achieve sufficient connection flexibility for our

target applications, and cuts the interconnect energy cost

by a factor of 7 compared to a straightforward crossbar

network implementation.

4.3.2 Low-swing Interconnect Interface Circuits

Communication energy is further reduced by employing a

low-swing (0.4V) pseudo-differential signaling scheme

(Figure 8). The wire capacitance loads are also reduced

by simplifying the switch network with NMOS-only

switches. The circuit employs an NMOS-only push-pull

driver with a very low voltage supply. The receiver is a

clocked sense amplifier with low input-offset and good

sensitivity followed by a static flip-flop. It contains

double pairs of input transistor, with the gates of P1 and

P3 connected to d , while the gates of P4 and P2 biased at

GND and REF respectively. Figure. 8 shows the

signaling waveforms. Based on our asynchronous

clocking protocol, the clock signal is generated from thehandshaking signals. The low-swing signaling reduces

the interconnect energy by a factor of 3.4 compared to a

full-swing CMOS implementation [17].

5 RESULTS AND STATISTICS

Figure 8: Pseudo-differential low-swing interconnect

circuitr

P1

N2

VDD

N3 N1

clk

clk

REF

in

P5

N4

BA

d

clk

REF

P6

P2

P7

P4P3

n1 n2

out

GND

GND

clk

in

d

out

A B

0.4V1V



Maia is a 210-pin chip that contains 1.2 million

transistors and measures 5.2×6.7mm2

in a 0.25 µm 6-

metal CMOS technology. Figure 9 shows the die photo

of the Maia chip and Table 1. summarizes all the

implementation statistics of the chip.

Technology 0.25 µm 6-level metal

CMOS

Main Supply Voltage 1 V

Additional Voltages 0.4 V, 1.5 V

Die Size 5.2 mm x 6.7 mm

Transistor Count 1.2 Million transistors

Average Cycle Speed 40 MHz

Average Power Dissipation 1.5 - 2 mW

Table 1: Chip Characteristics

Hardware

modules

Pipeline

speed

(ns)

Energy

consumptio

n per

operation

(PJ)

Area

(mm2)

MAC 24 21 0.25

ALU 20 8 0.09

Memory (1K x

16)

14 8 0.32

Memory (512 x

16)

11 7 0.16

Address generator 20 6 0.12

Interconnect

network

10 1* NA

FPGA 25 18** 2.76

Table 2: Performances of hardware modules

*This number is the average energy consumption per

connection

**This number is the average energy consumption across

various arithmetic functions

Table 2 shows the performances of different chip

components (based on a per-block analysis) from

PowerMill simulation.

Figure 10 (see the end of the paper, after references)

illustrates the signals that are available at the I/O pins.

During the TEST mode, all satellites and the

reconfigurable interconnect can be configured by writing

to Taddr and Tdata pins (to the ConfigAdd and

ConfigData buses) and the result of the computation can

be read on the Tdata and FIQ pins (from ReadData and

ACK buses). In addition, simple programs can also be

fed to ARM8 via Tdata pins to test satellite configuration

reading and writing. The current test set-up supports the

test mode described above and a board to verify the

SYSTEM mode is being designed. The HP 16702A logic

analysis system was used for generating the test vectors

(derived from Timemill simulations) for the TEST mode.

Pattern acquisition was used for verifying the results of

the computations after detecting end of kernel using an

external interrupt signal.

Energy and performance of all kernels are tested in the

TEST mode. Based on this information, the estimated

energy dissipation of the processor when programmed for

a VCELP voice coder (with 1.8mW total power

consumption) is presented in Table 3, including a

breakdown of the energy over the major functions.

Dominant kernels are directly mapped onto hardware

satellites, and their run-time reconfiguration is performedby the ARM core. Therefore, the kernel energy presented

in the table incorporate contributions from both satellite

and ARM8 configuration. The program control part of

the algorithm is completely mapped to the software. The

total energy efficiency is a factor of 8 better than the best

reported in literature [18].



Functionality Energy consumption (mJ) for 1 sec

of VCELP speech processing

Dot product 0.738

FIR filter 0.131

IIIR filter 0.021

Vector sum with scalar multiply 0.042

Compute code 0.011Kernels

Covariance matrix compute 0.006

Program control 0.838

Total 1.787

Table 3. VSELP energy breakdown

6 CONCLUSIONIn this paper, Pleiades, a heterogeneous reconfigurable

architecture template is introduced and a design

methodology to map algorithms to architectures is

summarized. The details of the design and

implementation of an instance of the Pleiades

architecture is presented. The implementation echoes the

current trend in system-on-a-chip design which contains

embedded components of various flexibility and

reconfigurability (microprocessor, ASICs, FPGA). The

heterogeneity and reconfigurability of the architecture

proves to be very energy efficient when compared to

state-of-the-art programmable processors.

ARM8 Core

Interface

FPGA

ALU

MEM

MAC

AGU MEM AGU

ALU

MEMMAC

AGUMEMAGU

MEM

AGU AGU

MEM

MEM

AGU AGU

MEM

Interconnect Network

Figure 9. Maia die photo



7 ACKNOWLEDGEMENTS

We would like to acknowledge DARPA’s support for the

Pleiades project (DABT-63-96-C-0026). The authors

would like to thank Seno Katsunori and Yuji Ichikawa

for their early work on the Pleiades prototype and

evaluation. We would like to acknowledge othermembers on the Maia design team.

8 REFERENCES

[1] M. Goel and N. R. Shanbhag, “Low-power

equalizers for 51.84 Mb/s very high-speed digital

subscriber loop [VDSL] modems”, Proceedings of

IEEE Workshop on Signal Processing Systems, Oct.

1998, Boston.

[2] A. Abnous and J. Rabaey, “Ultra-Low-Power

Domain- Specific Multimedia Processors”,Proceedings of the IEEE VLSI Signal Processing

Workshop, San Francisco, California, USA, October

1996.

[3] A. Abnous et al., “Evaluation of a Low-Power

Reconfigurable DSP Architecture”, Proceedings of

the Reconfigurable Architectures Workshop, Orlando,

Florida, USA, March 1998.

[4] J. Borel, “Technologies for multimedia systems on a

chip”, 1997 IEEE International Solid-State Circuits

Conference. pages. 18-21.

[5] G. R. Goslin, “A Guide to Using FieldProgrammable Gate Arrays for Application-Specific

Digital Signal Processing Performance”, Proceedings

of SPIE, vol. 2914, p321-331.

[6] J. Hauser and J. Wawrzynek. GARP: A MIPS

processor with a reconfigurable coprocessor. In J.

Arnold and K. L. Pocek, editors, Proceedings of

IEEE Worship on FPGA for Custom Computing

Machines, Napa, CA, April 1997.

[7] T. Garverick et al, NAPA1000, http://

www.national.com/appinfo/milaero/napa1000

[8] J. M. Rabaey, “Reconfigurable Computing: theSolution to Low Power Programmable DSP”, Proc. to

1997 ICASSP Conference, Munich, April 1997.

[9] M. Benes, “Deisng and Implementation of

Communication and Switching Techniques for the

Pleiades Family of Processors”, Master’s Thesis, UC

Berkeley, 1999.

[10] M. Wan, D. Lidsky, Y. Ichikawa and J. Rabaey. “An

Energy-Conscious Methodology for Early Exploration

of Heterogeneous DSPs”, Proceedings of CICC 1998.

[11] M. Wan, H. Zhang, V. George, M. Benes, A.Abnous and J. Rabaey, "Design Methodology of a

Low-Energy Reconfigurable Single-Chip DSP

System", Journal of VLSI Signal Processing 2000.

[12] M. Wan, H. Zhang, M. Benes and J. Rabaey, “A

Low-Power Reconfigurable Data-Driven DSP

System”, Proceedings of the SiPS99

[13] H. Zhang, M. Wan, V. George, J. Rabaey, “Intercon-

nect Architecture Exploration for Low Energy Recon-

figurable Single-Chip DSPs”, Proceedings of the

WVLSI , Orlando, FL, USA, April 1999

[14] S. Li, M. Wan and J. Rabaey, “Configuration CodeGeneration and Optimizations for Heterogeneous

Reconfigurable DSPs”, Proceddings of SiPS, 1999.

[15] V. George, H. Zhang, J. Rabaey, “Low Energy

FPGA Design”, Proceedings of ISLPED 1999.

[16] T. Burd, T. Pering, A. Stratakos, R. Brodersen,”A

Dynamic Voltage-Scaled Microprocessor System”,

Proceedings of ISSCC 2000.

[17] Hui Zhang et al, “Low-Swing Interconnect Interface

Circuits”, Proceedings of ISLPED 1997.

[18] Wai Lee et al, “ A 1V DSP for Wireless

Communication”, Digest of Technical Papers of ISSCC 97



Addr<31:0>

Dq<31:0>

Other controls

ARM8

Core

Rdata

Wdata

VAddress

Requests

Responses

Interrupt

ConfigAdd

Interface

IO Pins

ConfigData

ReadData

32

32

32

16

32

32

Satellites

Off-chip

SRAM

Strobe

StartACKs

22

10

TEST MODE

Taddr<15:0>Tdata<31:0>

Test,TRwn,TClk,FIQ etc.

Logic Analyzer

SYSTEM MODE

Figure 10. Maia chip testing strategy

A Low-Energy Heterogeneous Reconfigurable DSP IC

Documents

Transcript of A Low-Energy Heterogeneous Reconfigurable DSP IC