A Low-Energy Heterogeneous Reconfigurable DSP IC
-
Upload
jadur-rahman -
Category
Documents
-
view
222 -
download
0
Transcript of A Low-Energy Heterogeneous Reconfigurable DSP IC
8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC
http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 1/10
A Low-Energy Heterogeneous Reconfigurable DSP IC
1 INTRODUCTION
The advent of the third generation of wireless
applications creates a need for digital signal processing
platforms that simultaneously display high computational
performance, ultra low-energy consumption and a high
degree of flexibility and adaptability. The flexibility and
adaptability is a necessity in the presence of multiple and
evolving standards, and helps to increase quality-of-
service in the presence of dynamically evolving channel
conditions. (Re)configurable processors offer the
advantage of combining flexibility and low-energy [1][2]by providing a direct spatial mapping from algorithm to
architecture, hence reducing the control overhead
typically associated with instruction-set processors.
A low power reconfigurable DSP architecture template
(Pleiades) which encapsulates heterogeneous computing
elements has been proposed [2][3] to solve the problem of
meeting the requirement of flexibility, speed and energy
efficiency at the same time (Figure. 1). The Pleiades
architecture style echoes the current trend in system-on-
a-chip design which includes a wide variety of
macromodules including core processors, DSPs,
programmable logic, embedded memory, and custom
modules [4]. The heterogeneous architecture style of
Pleiades allows better algorithm-architecture matching
,giving better power/performance than many
heterogeneous reconfigurable processors which
incorporate only a microprocessor and fine-grained
FPGAs.
In this paper, we describe the design process and
implementation results of an instance of the Pleiades
architecture , Maia, targetting the speech coding domain.
In section 2, we give a description of the Pleiades
architecture template and the model of reconfiguration.
In section 3 and 4, the methodology used to mapalgorithms to an architecture is given and the
implementation of the architecture is discussed. Section 5
reports the testing strategy for the design and results of
the final chip.
2 HETEROGENEOUS
RECONFIGURABLE DSP
Reconfigurable architectures [5][6][7] have received
significant attention in recent years in both the general
purpose computing as well as embedded processing.Mixing processor with fine-grain reconfigurable elements
has been the main approach attempted by the above
systems. The Pleaides reconfigurable architecture
achieves low energy consumption by providing a
computational platform with mixed programming
granularity (i.e. microprocessor, reconfigurable dataflow,
FPGA) [8]. In this section, we explain our architecture in
concept, and provide a description of the reconfiguration
and computation models used in our design methodology.
2.1 Architecture Template
The Pleiades architecture (Figure. 2) is composed of a
programmable microprocessor and heterogeneous
computing elements (referred to as satellites in the rest of
the paper). The architecture template fixes the
communication primitives between the microprocessor
clocks
data data
flags
controlsignals
flags
handshake handshake
control
ASIC or FPGA
configregister
Reconfigurable Interconnect
A Satellite
module
Microprocessor
Reconfiguration Bus
SAT1 SAT2 SAT3
SATConfiguration
SAT
Figure 2. Heterogeneous Architecture Template
Energy-Efficiency
Flexibility
µPDSPASIC Pleiades
Figure 1. Energy and Flexibility Spectrum for Different
Architectures
8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC
http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 2/10
and satellites and between each satellite. For each
algorithm domain (communication, speech coding, video
coding), an architecture instance can be created (with
known satellite types and numbers)
To reduce overhead in terms of instruction fetch and glo-
bal control, the architecture utilizes distributed control
and configuration. To achieve distributed control, each
satellite is equipped with an interface that enables it to
exchange data streams with other satellites efficiently,
without the help of a global controller. The
communication mechanism between each satellite is
dataflow driven [9].
The control means available to the programmer are basic
satellite configurations to specify the kind of operation to
be performed by the satellite, and configurations for the
reconfigurable interconnect to build a cluster of satellites.
2.2 Model of Computation and
Reconfiguration
While multiple threads of application can be run on an
instance of the Pleiades architecture template, the
compilation of a single thread down to the reconfigurable
components is the main core of the higher level
scheduling tools that can utilize multi-threads. Therefore,
the design methodology described later in the paper aims
to support a smooth transition from a single thread
algorithm to an optimized implementation on Pleiades.
Figure. 3 illustrates the flow of computation supported bythis software methodology. As shown in the figure, a
sequential thread is first initialized on the
microprocessor. After configuration codes are executed
on the processor, the control is transferred to the Pleiades
reconfigurable satellites (the “split” point in Figure.3)
and the computation is returned back to the processor
after all satellite operations are finished (the “join”
point). Multiple split points exist within a seqeuntial
thread and the satellites and connections have to be
reconfigured for each split point.
The main idea behind reconfigurable computing that is
advocated by the Pleiades system is to build a computa-
tional engine through spatially-programmed connections
of processing elements (satellites). The interconnect
model that needs to support such a system is depicted in
Figure. 4. On the time axis, t0, t1 and t2 indicate the time
of reconfiguration. The bars (C1, C2 etc.) in-between two
reconfiguration times represent a set of inter-satellite
connections that has to be realized simultaneously by the
reconfigurable interconnect.
3 OVERVIEW OF THE
ARCHITECTURE DESIGN
METHODOLOGY
There are two key issues to be resolved in order to make
the methodology practical to the designers. Firstly, the
architecture combines two very distinct models of
computation, control-driven computation on the general-
purpose microprocessor and data-driven computing on
the clusters of satellites. Therefore, the goal of the
architectural exploration process is to partition the
application over these two computing paradigms so that
performance and energy dissipation constraints are met
(during the compilation process). Secondly,
optimizations related to reconfigurability have to be
supported at both the architecture design as well as com-
pilation stage. Both of these issues requires careful
modeling of the algorithm and the underlying
heterogeneous architectures.
The basic flow of the design exploration methodology
[10] is presented in Figure. 5. After the introduction of
terminology, a short overview of the overall flow is given
join
split
Application Thread1
Thread2Thread3
on programmable processor
on satellites
Figure 3. Flow of Computation on Pleiades
AG AG
MEM MEM
MA C /MUL
AG AG
MEM MEM
MAC/ MU L
AG
MEM
ALU
3 Address Generators, 3 Memories, 1 MAC/MUL and 1 ALUArchitecture Instance:
C1 C2
C3 C4
C5
C’8
C’1
reconfiguration
TimeC1
C2
C’1
C’4
C’8
t1 t2t0
i n t e r c o
n n e c t
p a t t e r n
:C5
:
Figure 4. Model of Reconfiguration
8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC
http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 3/10
in section 3.1.1. A more detailed description of this
methodology and tools developed can be found in [11].
Definition of Kernel - A computational intensive part of
the algorithm that often resides in nested loops.
3.1.1 Basic Methodology Flow
The methodology flow takes DSP or communication
algorithms specified in a high-level language (e.g. C) as
input. The initiation of the design process requires the
establishment of a first-order baseline model of the
algorithm complexity and bottlenecks. Such a model
allows for the selection and execution of architecture-
independent optimizations (stage 1). As architectural
choices have yet to be made, this model assumes the
presence of a “virtual architecture” with some generic
operator costs attached to it. Optimizations at this stage
only address either win-only situations or order-of-
magnitude improvements, so that absolute accuracy is notthat important.
After a satisfactory algorithm formulation is obtained, the
architectural mapping and partitioning process can be
entered. To be meaningful, the partitioning process
should be based on realistic bottom-up information
regarding the cost of implementing functions and
operations on the different architectural choices. Our
design-exploration methodology relies extensively on the
availability of power-delay models for all components in
its architectural library (stage 2). The estimation methods
employed in each of these models vary depending upon
the type of the module and the desired accuracy. While
the absolute accuracy of these characterizations is not
crucial, it is important that bounds on the prediction
accuracy are known. Only “Improvements” that fall
within the noise level of the estimations are accepted.
The architecture partitioning and mapping process is
started by establishing an initial solution. Given the
implementation simplicity of a pure software
implementation, we have adopted a “software-centric”
approach that assumes that the whole algorithm is
initially mapped onto the microprocessor (stage 3). Thisestablishes how close such a solution adheres to the
design specifications and helps to establish the design
bottlenecks. A rank ordering of the dominant compute
kernels is established. In stage 4, dominant kernels are
evaluated in order of importance and mapped to satellites
for better power and performance [12]. If a hardware
implementation is deemed worthwhile, a repartitioning of
the design is established.
After all costly kernels are mapped to accelerators, a final
partition of the algorithm across different architectures is
obtained (stage 5) and memory assignment and allocation
is performed to minimized memory trasfers. While the
rest of the algorithm remains as high- level language, the
portions of the algorithm to be implemented by satellites
are specified in an intermediate form that is capable of
modeling the structure of the reconfigurable satellite
operations (i.e. as a netlist). Based on this conceptual
netlist, implementation optimizations (stage 6) [13] are
invoked to choose a good reconfigurable interconnect
architecture (during architecture design path) and to
generate efficient configuration and interface code
(during compilation and test vector generation path) [14].
Applications
Architecture
Specification
Algorithm
Refinement Characterization
Microprocessors(s)
Satellites
Mapping to Core
Kernel Ranking
Mapping to accelerators
PDA macro-model
Performance Evaluation
PDA
models
Exploration
Kernel
timing, power constraintsstage 1.
stage 3.
stage 2.
stage 4.
stage 5.
Partitioning
Compilation/Code Generation
InterconnectOptimization
Reconfig. Hardware
stage 6.
ImplementationOptimization
Figure 5. The Software Methodology
Flow
8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC
http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 4/10
8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC
http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 5/10
small local instruction memory, and can be programmed
to support various types of addressing patterns and nested
loops with loop counters and stride counters. It behaves
as the local controller of data-flow kernels by initiating
the data-flow threads, and by signaling the end of the
data-flow threads to the ARM8.
4.1.3 Embedded FPGA
Commercial FPGAs are often notorious for their energy
consumption and most of them can not be embedded in a
system-on-a-chip. Therefore, we make use of an in-house
low-energy embedded FPGA [15].
The embedded FPGA contains a 4×8 array of 5-input 3-
output CLBs, optimized for arithmetic operations and
data-flow control functions. Its energy-efficiency has
been measured to be 70 times higher than equivalentcommercial solutions. This energy efficient FPGA design
is realized by combining both architectural and circuit
level modifications which are outlined below.
Logic block The logic block is designed to improve
the interconnect utilization, and hence the interconnect
energy. It is made up of a cluster of 3 input look-up-
tables. It can be used to implement 5 input random logic,
or 2 bit arithmetic operations.
Low-swing circuit Low-swing interconnect circuit
improves the energy by a factor of 2 as compared to a full
swing circuit. The logic blocks operate on 1.5V while the
low-swing signal lines have a 0.8V swing.Interconnect Architecture The interconnect is made
up of 3 levels of connectivity. Each level is targeted at
providing low energy connections for specific path
lengths. The Level0 structure is targeted at connections
between nearest neighbors. Each logic block can connect
to 8 of it’s immediate neighbors. The Level1 structure is
the traditional symmetric mesh architecture, and is good
for intermediate length wires. The Level3 structure is
used for implementing connections that span a
significant fraction of the chip. The connectivity of each
of these structures has been optimized using architecture
evaluation tools to obtain energy efficiency.Clock Distribution More than 80% of the clock
energy is dissipated in the clock distribution network.
Double-edge-triggered Flip-Flops are used to reduce the
clock activity by factor of 2, and hence a proportional
reduction in energy. The clock distribution network also
uses the low-swing technique for energy reduction.
4.2 Communication Interface Description
4.2.1 Inter- satellites Communication Interface
The data-flow driven synchronization between the
processing elements employs a 2-phase self-timed
handshaking scheme with REQUEST and
ACKNOWLEDGE signals (Figure 7a), realized in a
globally asynchronous locally synchronous
implementation fashion. This approach not only reduces
power consumption by ensuring that a module is only
activated when data is ready, but also allows variousmodules to operate at different and dynamically varying
rates. Data links combine 16-bit fixed-width data words
with 2-bit control tokens that serve as tags for different
data structures (scalar, vector, or matrix) that are
supported by the network (Figure 7b). Each module
includes a network interface controller to coordinate
communication and synchronization.
4.2.2 Communication Interface between the
Microprocessor and Satellites
This interface control unit coordinates synchronization
and communication between the synchronous ARM8 core
and the asynchronous reconfigurable data-paths, most
importantly helping the core perform the reconfiguration
of satellites by mapping all the configuration memories to
the ARM8 memory space.
In
Reqin
Clk
Enable Done
Processor
Module
Out
Reqin
Reqoutclk
Clk Done
delay R e c o n f i g u r a b l e
N e t w o r k
1
1
1
1
n
nMPY MPY
n
n
1MACData associated with an end-of-vector token
Regular data
(a) Globally asynchronous - locally synchronous signaling
(b) Control tokens differentiate and delineate data streams and data structures (scalar,
vector, matrix)
Figure 7. Data-flow driven globally synchronous locally
asynchronous communication protocal
8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC
http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 6/10
The interface logic controls the strobe generation for
configuration reads/writes, handshakes, network reset,
start requests for the address generators and IO ports.
The acknowledge signals for the address generators and
IO ports are used to detect the end of kernel and the
ARM8 core is interrupted. Interrupt mask registers and
control registers are used to synchronize ARM8 with theasynchronous satellite array.
The system supports two modes of operation: TEST and
SYSTEM modes. As part of the test strategy, the TEST
mode allows us to bypass the ARM8 processor and
execute individual kernels through the interface. In the
SYSTEM mode, instead of an on-chip cache for the
embedded ARM8, an external SRAM (with zero bus
turnaround) serves as the memory for the processor. In
order to meet the 40MHz performance for the
application, the off-chip memory is clocked twice as fast
as the core. The interface is designed to meet this
bandwidth.
4.3 Reconfigurable Interconnect
Architecture
Keeping the energy of the reconfigurable interconnect
network as low as possible while still meeting the
flexibility requirement is crucial to the success of out
approach of heterogeneous reconfigurable architecture.
This is realized by a combination of architecture and
circuit optimizations.
4.3.1 Hierarchical Interconnect Network Architecture
Energy-efficient architecture must take advantages of the
locality and regularity of computation. Exploiting locality
by identifying natural isolated clusters of operations, can
be used to guide hardware partitioning resulting in the
minimization of global busses, thus reducing the
interconnect power. Although the underlying system is
heterogeneous, the DSP algorithms usually have
inherently repetitive computation patterns. Partitioning
the hardware by preserving such regularity will lead to
simpler interconnect structure with reduced fan-ins and
fan-outs. Especially for reconfigurable architectures,
more regular interconnect architecture achieve better
routability and less reconfiguration overhead. There is
trade-off between flexibility and energy-efficiency. For
instance, the crossbar network has the most flexibility,
but also the least energy efficiency. In stage 6 of the
design methodology, cross-bar, mesh and hierarchical
mesh structures are evaluated, and a 2-level hierarchical
mesh is decided for this implementation.
The implemented hierarchical interconnect mesh
network can provide the optimum energy-efficiency with
right degree of flexibility within the application domain
of interest. Several clusters of tightly connected modules
are formed based on the communication locality. Each
cluster has a local mesh with 2 buses per channel, and a
universal switchbox at every intersection point (Figure
6). Global interconnections are supported by a 2nd
level
larger-granularity mesh (implemented on the higher
metal layers) with 2 buses per channel and hierarchical
switchboxes, located at the key connection points. The
hierarchical switchbox (Figure 6) contains a universal
switchbox for each mesh-level, as well as a number of
cross-level interconnect switches. This hierarchical
network architecture requires only a limited number of
buses to achieve sufficient connection flexibility for our
target applications, and cuts the interconnect energy cost
by a factor of 7 compared to a straightforward crossbar
network implementation.
4.3.2 Low-swing Interconnect Interface Circuits
Communication energy is further reduced by employing a
low-swing (0.4V) pseudo-differential signaling scheme
(Figure 8). The wire capacitance loads are also reduced
by simplifying the switch network with NMOS-only
switches. The circuit employs an NMOS-only push-pull
driver with a very low voltage supply. The receiver is a
clocked sense amplifier with low input-offset and good
sensitivity followed by a static flip-flop. It contains
double pairs of input transistor, with the gates of P1 and
P3 connected to d , while the gates of P4 and P2 biased at
GND and REF respectively. Figure. 8 shows the
signaling waveforms. Based on our asynchronous
clocking protocol, the clock signal is generated from thehandshaking signals. The low-swing signaling reduces
the interconnect energy by a factor of 3.4 compared to a
full-swing CMOS implementation [17].
5 RESULTS AND STATISTICS
Figure 8: Pseudo-differential low-swing interconnect
circuitr
P1
N2
VDD
N3 N1
clk
clk
REF
in
P5
N4
BA
d
clk
REF
P6
P2
P7
P4P3
n1 n2
out
GND
GND
clk
in
d
out
A B
0.4V1V
8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC
http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 7/10
Maia is a 210-pin chip that contains 1.2 million
transistors and measures 5.2×6.7mm2
in a 0.25 µm 6-
metal CMOS technology. Figure 9 shows the die photo
of the Maia chip and Table 1. summarizes all the
implementation statistics of the chip.
Technology 0.25 µm 6-level metal
CMOS
Main Supply Voltage 1 V
Additional Voltages 0.4 V, 1.5 V
Die Size 5.2 mm x 6.7 mm
Transistor Count 1.2 Million transistors
Average Cycle Speed 40 MHz
Average Power Dissipation 1.5 - 2 mW
Table 1: Chip Characteristics
Hardware
modules
Pipeline
speed
(ns)
Energy
consumptio
n per
operation
(PJ)
Area
(mm2)
MAC 24 21 0.25
ALU 20 8 0.09
Memory (1K x
16)
14 8 0.32
Memory (512 x
16)
11 7 0.16
Address generator 20 6 0.12
Interconnect
network
10 1* NA
FPGA 25 18** 2.76
Table 2: Performances of hardware modules
*This number is the average energy consumption per
connection
**This number is the average energy consumption across
various arithmetic functions
Table 2 shows the performances of different chip
components (based on a per-block analysis) from
PowerMill simulation.
Figure 10 (see the end of the paper, after references)
illustrates the signals that are available at the I/O pins.
During the TEST mode, all satellites and the
reconfigurable interconnect can be configured by writing
to Taddr and Tdata pins (to the ConfigAdd and
ConfigData buses) and the result of the computation can
be read on the Tdata and FIQ pins (from ReadData and
ACK buses). In addition, simple programs can also be
fed to ARM8 via Tdata pins to test satellite configuration
reading and writing. The current test set-up supports the
test mode described above and a board to verify the
SYSTEM mode is being designed. The HP 16702A logic
analysis system was used for generating the test vectors
(derived from Timemill simulations) for the TEST mode.
Pattern acquisition was used for verifying the results of
the computations after detecting end of kernel using an
external interrupt signal.
Energy and performance of all kernels are tested in the
TEST mode. Based on this information, the estimated
energy dissipation of the processor when programmed for
a VCELP voice coder (with 1.8mW total power
consumption) is presented in Table 3, including a
breakdown of the energy over the major functions.
Dominant kernels are directly mapped onto hardware
satellites, and their run-time reconfiguration is performedby the ARM core. Therefore, the kernel energy presented
in the table incorporate contributions from both satellite
and ARM8 configuration. The program control part of
the algorithm is completely mapped to the software. The
total energy efficiency is a factor of 8 better than the best
reported in literature [18].
8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC
http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 8/10
Functionality Energy consumption (mJ) for 1 sec
of VCELP speech processing
Dot product 0.738
FIR filter 0.131
IIIR filter 0.021
Vector sum with scalar multiply 0.042
Compute code 0.011Kernels
Covariance matrix compute 0.006
Program control 0.838
Total 1.787
Table 3. VSELP energy breakdown
6 CONCLUSIONIn this paper, Pleiades, a heterogeneous reconfigurable
architecture template is introduced and a design
methodology to map algorithms to architectures is
summarized. The details of the design and
implementation of an instance of the Pleiades
architecture is presented. The implementation echoes the
current trend in system-on-a-chip design which contains
embedded components of various flexibility and
reconfigurability (microprocessor, ASICs, FPGA). The
heterogeneity and reconfigurability of the architecture
proves to be very energy efficient when compared to
state-of-the-art programmable processors.
ARM8 Core
Interface
FPGA
ALU
MEM
MAC
AGU MEM AGU
ALU
MEMMAC
AGUMEMAGU
MEM
AGU AGU
MEM
MEM
AGU AGU
MEM
Interconnect Network
Figure 9. Maia die photo
8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC
http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 9/10
7 ACKNOWLEDGEMENTS
We would like to acknowledge DARPA’s support for the
Pleiades project (DABT-63-96-C-0026). The authors
would like to thank Seno Katsunori and Yuji Ichikawa
for their early work on the Pleiades prototype and
evaluation. We would like to acknowledge othermembers on the Maia design team.
8 REFERENCES
[1] M. Goel and N. R. Shanbhag, “Low-power
equalizers for 51.84 Mb/s very high-speed digital
subscriber loop [VDSL] modems”, Proceedings of
IEEE Workshop on Signal Processing Systems, Oct.
1998, Boston.
[2] A. Abnous and J. Rabaey, “Ultra-Low-Power
Domain- Specific Multimedia Processors”,Proceedings of the IEEE VLSI Signal Processing
Workshop, San Francisco, California, USA, October
1996.
[3] A. Abnous et al., “Evaluation of a Low-Power
Reconfigurable DSP Architecture”, Proceedings of
the Reconfigurable Architectures Workshop, Orlando,
Florida, USA, March 1998.
[4] J. Borel, “Technologies for multimedia systems on a
chip”, 1997 IEEE International Solid-State Circuits
Conference. pages. 18-21.
[5] G. R. Goslin, “A Guide to Using FieldProgrammable Gate Arrays for Application-Specific
Digital Signal Processing Performance”, Proceedings
of SPIE, vol. 2914, p321-331.
[6] J. Hauser and J. Wawrzynek. GARP: A MIPS
processor with a reconfigurable coprocessor. In J.
Arnold and K. L. Pocek, editors, Proceedings of
IEEE Worship on FPGA for Custom Computing
Machines, Napa, CA, April 1997.
[7] T. Garverick et al, NAPA1000, http://
www.national.com/appinfo/milaero/napa1000
[8] J. M. Rabaey, “Reconfigurable Computing: theSolution to Low Power Programmable DSP”, Proc. to
1997 ICASSP Conference, Munich, April 1997.
[9] M. Benes, “Deisng and Implementation of
Communication and Switching Techniques for the
Pleiades Family of Processors”, Master’s Thesis, UC
Berkeley, 1999.
[10] M. Wan, D. Lidsky, Y. Ichikawa and J. Rabaey. “An
Energy-Conscious Methodology for Early Exploration
of Heterogeneous DSPs”, Proceedings of CICC 1998.
[11] M. Wan, H. Zhang, V. George, M. Benes, A.Abnous and J. Rabaey, "Design Methodology of a
Low-Energy Reconfigurable Single-Chip DSP
System", Journal of VLSI Signal Processing 2000.
[12] M. Wan, H. Zhang, M. Benes and J. Rabaey, “A
Low-Power Reconfigurable Data-Driven DSP
System”, Proceedings of the SiPS99
[13] H. Zhang, M. Wan, V. George, J. Rabaey, “Intercon-
nect Architecture Exploration for Low Energy Recon-
figurable Single-Chip DSPs”, Proceedings of the
WVLSI , Orlando, FL, USA, April 1999
[14] S. Li, M. Wan and J. Rabaey, “Configuration CodeGeneration and Optimizations for Heterogeneous
Reconfigurable DSPs”, Proceddings of SiPS, 1999.
[15] V. George, H. Zhang, J. Rabaey, “Low Energy
FPGA Design”, Proceedings of ISLPED 1999.
[16] T. Burd, T. Pering, A. Stratakos, R. Brodersen,”A
Dynamic Voltage-Scaled Microprocessor System”,
Proceedings of ISSCC 2000.
[17] Hui Zhang et al, “Low-Swing Interconnect Interface
Circuits”, Proceedings of ISLPED 1997.
[18] Wai Lee et al, “ A 1V DSP for Wireless
Communication”, Digest of Technical Papers of ISSCC 97
8/8/2019 A Low-Energy Heterogeneous Reconfigurable DSP IC
http://slidepdf.com/reader/full/a-low-energy-heterogeneous-reconfigurable-dsp-ic 10/10
Addr<31:0>
Dq<31:0>
Other controls
ARM8
Core
Rdata
Wdata
VAddress
Requests
Responses
Interrupt
ConfigAdd
Interface
IO Pins
ConfigData
ReadData
32
32
32
16
32
32
Satellites
Off-chip
SRAM
Strobe
StartACKs
22
10
TEST MODE
Taddr<15:0>Tdata<31:0>
Test,TRwn,TClk,FIQ etc.
Logic Analyzer
SYSTEM MODE
Figure 10. Maia chip testing strategy