TMS320C4X Digital Signal Processing

21

description

About TMS320C4X

Transcript of TMS320C4X Digital Signal Processing

Page 1: TMS320C4X Digital Signal Processing
Page 2: TMS320C4X Digital Signal Processing

Block diagram (You can simplify how the buses are depicted etc.. and the fig. will

become smaller!!)

The TMS320C4x devices are 32-bit floating-point digital signal processors optimized

for parallel processing. The ’C4x family combines a high performance CPU and DMA

controller with up to six communication ports to meet the needs of multiprocessor and

I/O-intensive applications. Each ’C4x device contains an on-chip analysis module, which

supports hardware breakpoints for parallel processing development and debugging. The

Page 3: TMS320C4X Digital Signal Processing

’C4x family is source-code compatible with the TMS320C3x family of floating-point

DSPs.

The TMS320C40 is the original member of the ’C4x family. It features a CPU that can

deliver up to 30 MIPS/60 MFLOPS with a maximum I/O bandwidth of 384M bytes/s.

The ’C40 has 2K words of on-chip RAM, 128 words of program cache and a bootloader.

Two external buses provide an address reach of 4 gigawords of unified memory space.

The ’C40 is available in a 325-pin CPGA package.

The TMS320C44

The TMS320C44 is a lower cost version of the ’C40, for parallel processing applications

that are more price sensitive. The ’C44 features four communication ports and has an

external address reach of 32M words over two external buses. To further reduce cost, the

’C44 comes in a 304-pin PQFP package. The TMS320C44 can deliver up to 30 MIPS/60

MFLOPS performance with a maximum I/O bandwidth of 384M bytes/s. The ’C44 is

source-code compatible with the ’C40.

1. Key Features of the TMS320C4x

The TMS320C4x has several key features:

_ Up to 40 MIPS/80 MFLOPS performance with 488-Mbytes/s I/O capability

_ IEEE floating-point conversion for ease of use

_ Register-based CPU

_ Single-cycle byte and half-word manipulation capabilities

_ Divide and square root support for improved performance

_ On-chip memory includes 2K words of SRAM, 128 words of program

cache, and bootloader

_ Two external buses providing an address reach of up to 4 gigawords

_ Two memory-mapped 32-bit timers

_ 6 and 12 channel DMA

_ Up to six communication ports for multiprocessor communication

_ Idle mode for reduced power consumption

Page 4: TMS320C4X Digital Signal Processing

Central Processing Unit (CPU)

The ’C4x’s CPU has a register-based architecture. The CPU consists of the

several components:

1.Floating-point/integer multiplier

2.Arithmetic Logic Unit (ALU)

3.32-bit barrel shifter

4.Internal buses (CPU1/CPU2 and REG1/REG2)

5.Auxiliary register arithmetic units (ARAUs)

6.CPU register file

Floating-Point/Integer Multiplier

The multiplier performs single-cycle multiplications on 32-bit integer and 40-bit floating-

point values. The ’C4x implementation of floating-point arithmetic allows for floating-

point operations at fixed-point speeds via a 25-ns instruction cycle and a high degree of

parallelism. To gain even higher throughput, you can use parallel instructions to perform

a multiply and ALU operation in a single cycle.

When the multiplier performs floating-point multiplication, the inputs are 40-bit floating-

point numbers, and the result is a 40-bit floating-point number. When the multiplier

performs integer multiplication, the input data is 32 bits and yields either the 32 most-

significant bits or the 32 least-significant bits of the resulting 64-bit product.

Arithmetic Logic Unit (ALU) and Internal Buses

The ALU performs single-cycle operations on 32-bit integer, 32-bit logical, and

40-bit floating-point data, including single-cycle integer and floating-point conversions.

Results of the ALU are always maintained in 32-bit integer or 40-bit floating-point

formats. The barrel shifter is used to shift up to 32 bits left or right in a single cycle.

Four internal buses, CPU1, CPU2, REG1, and REG2, carry two operands from

memory and two operands from the register file, thus allowing parallel multiplies

and adds/subtracts on four integer or floating-point operands in a single cycle.

Auxiliary Register Arithmetic Units (ARAUs)

The two auxiliary register arithmetic units (ARAU0 and ARAU1) can generate

two addresses in a single cycle. The ARAUs operate in parallel with the multiplier

Page 5: TMS320C4X Digital Signal Processing

and ALU. They support addressing with displacements, index registers (IR0 and IR1),

and circular and bit-reversed addressing.

CPU Primary Register File

The ’C4x primary register file provides 32 registers in a multiport register file that is

tightly coupled to the CPU. Table 2–1 lists register names and functions, followed by the

section number and page of each description. All of the primary register file registers can

be operated upon by the multiplier and ALU and can be used as general-purpose

registers. However, the registers also have some special functions. For example, the 12

extended-precision registers are especially suited for maintaining floating-point results.

The eight auxiliary registers support a variety of indirect addressing modes and can

be used as general-purpose 32-bit integer and logical registers. The remaining registers

provide system functions such as addressing, stack management, processor status,

interrupts, and block repeat.

The extended-precision registers (R0–R11) are capable of storing and supporting

operations on 32-bit integer and 40-bit floating-point numbers. Any instruction

that assumes that the operands are floating-point numbers uses bits 39–0. If the operands

are either signed or unsigned integers, only bits 31–0 are used, and bits 39–32 remain

unchanged. This is true for all shift operations.

The 32-bit auxiliary registers (AR0–AR7) can be accessed by the CPU and

modified by the two auxiliary register arithmetic units (ARAUs). The primary

function of the auxiliary registers is the generation of 32-bit addresses. They

can also be used as loop counters or as 32-bit general-purpose registers that

can be modified by the multiplier and ALU.

The data page pointer (DP) is a 32-bit register. The 16 LSBs of the data page

pointer are used by the direct addressing mode as a pointer to the page of data

being addressed. The ’C4x can address up to 64K pages, each page containing

64K words

The 32-bit index registers contain the value used by the auxiliary register

arithmetic unit (ARAU) to compute an indexed address.

The ARAU uses the 32-bit block size register (BK) in circular addressing to

specify the data block size.

Page 6: TMS320C4X Digital Signal Processing

The system stack pointer (SP) is a 32-bit register that contains the address

of the top of the system stack. The SP always points to the last element pushed

onto the stack. A push performs a pre-increment, and a pop performs a post-decrement

of the system stack pointer. The SP is manipulated by interrupts, traps, calls, returns, and

the PUSH/PUSHF and POP/POPF instructions.

The status register (ST) contains global information related to the state of the CPU.

Typically, operations set the condition flags of the status register according to whether

the result is zero, negative, etc. This includes register load and store operations as well as

arithmetic and logical functions. When the status register is loaded, however, a bit-for-bit

replacement is performed with the contents of the source operand, regardless of the state

of any bits in the source operand. Therefore, following a load, the contents of the status

register are identically equal to the contents of the source operand.

The DMA coprocessor interrupt enable register (DIE) is a 32-bit register

containing 2- and 3-bit fields to designate the interrupt synchronization

scheme for each of the six DMA channels. It allows each DMA channel to service

a corresponding input communication port and output communication

port. Also, each DMA channel can be synchronized with external interrupts or

the on-chip timers.

The CPU internal interrupt enable register (IIE) is a 32-bit register that enables/

disables interrupts for the six communication ports, both timers, and the

six DMA coprocessor channels.

The IIOF flag register (IIF) controls the function (general-purpose I/O or interrupt)

of the four external pins (IIOF0 to IIOF3). It also contains timer/DMA interrupt

flags.

The 32-bit repeat counter (RC) register specifies the number of times a block

of code is to be repeated when a block repeat is performed. When the processor

is operating in the repeat mode, the 32-bit repeat start address register

(RS) contains the starting address of the block of program memory to be repeated,

and the 32-bit repeat end address register (RE) contains the ending

address of the block to be repeated. Block Repear (RS,RE) and Repeat Count (RC)

Registers,

Page 7: TMS320C4X Digital Signal Processing

The program counter (PC) is a 32-bit register containing the address of the

next instruction to be fetched. Although the PC is not part of the CPU register

file, it is a register that can be modified by instructions that modify the program

flow.

CPU Expansion Register File

Besides the CPU primary register file, the expansion register file contains two

special registers that act as pointers:

_ The IVTP register points to the interrupt-vector table (IVT), which defines

vectors for all interrupts.

_ The TVTP register points to the trap vector table (TVT), which defines vectors

for 512 traps.

Memory Organization

The total memory reach of the ’C4x is 4G 32-bit words. Program memory (onchip

RAM or ROM and external memory) as well as registers affecting timers,

communication ports, and DMA channels are contained within this space. This

allows tables, coefficients, program code, and data to be stored in either RAM

or ROM. Thus, memory usage is maximized, and memory space allocated as

desired.

By manipulating one external pin (ROMEN), you can configure the first onemegaword

area of memory (0000 0000h to 000F FFFFh) to address the local

address bus or to address the on-chip ROM when you use the bootloader (with

remaining space reserved).

2.1 RAM, ROM, and Cache

The ROM block is reserved and contains a bootloader. Each RAM and ROM block is

capable of supporting two accesses in a single cycle. The separate program buses, data

buses, and DMA buses allow for parallel program fetches, data reads and writes, and

DMA operations. For example: the CPU can access two data values in one RAM block

and perform an external program fetch in parallel with the DMA coprocessor loading

another RAM block, all within a single cycle. The reserved ROM block (upper right

contains a bootloader. This loader supports loading of program and data at reset time.

Loading is from 8-, 16-, or 32-bit wide memories or any one of the six communication

Page 8: TMS320C4X Digital Signal Processing

ports. A 128k, 32-bit instruction cache is provided to store often-repeated sections of

code, thus greatly reducing the number of needed off-chip accesses. This allows for code

to be stored off-chip in slower, lower-cost memories. By using the cache to execute your

program, the external buses are freed for use by the DMA controller or CPU.

Memory Maps

For each processor, the level at the external pin ROMEN determines whether or not the

first megaword of memory addresses the internal ROM or external memory. The maps

illustrate the entire address space of the ’C40 and ’C44.

The value of ROMEN affects only the first megaword of memory:

_ A 1 at external pin ROMEN causes internal ROM to be enabled at 0000h

with the one-megaword space reserved (0000 0000h – 000F FFFFh).

This is shown in the right side of the figure.

_ A 0 at ROMEN causes addresses 0000 0000h – 000F FFFFh to be accessible

on the local bus. This is shown in the left side of the figure.

The rest of the memory map is the same for either level of ROMEN:

_ The second megaword of memory is devoted to peripherals

_ The third megaword of memory contains the two 1K-word (4K-byte) blocks

of RAM (BLK0 and BLK1 as shown at 002F F800h – 002F FFFFh).

_ The rest of the first 2 gigawords (0030 0000h – 7FFF FFFFh) is on the local

bus (external).

_ The second 2 gigawords (8000 0000h – FFFF FFFFh) are on the global

bus (external).

Caution

Any access to a reserved area in the address space produces unpredictable results.

Do not attempt to access reserved areas.

Memory Aliasing (’C44 only)

Memory aliasing occurs in the ’C44, since both the global and local ports on

that device have 24 pins, instead of the 31 pins on each port in the ’C40.

Memory aliasing causes the first 16 M of each address space to be repeated

in the memory map. Memory on the local bus occupies, and is aliased, in the

first 2 G of address space, and memory on the global bus occupies, and is

Page 9: TMS320C4X Digital Signal Processing

aliased, in the second 2 G of address space. Figure 2–7 shows the alias regions

on the local and global buses.

Memory Addressing Modes

The ’C4x supports a base set of general-purpose instructions as well as arithmetic-

intensive instructions that are particularly suited for digital signal processing

and other numeric-intensive applications. Refer to Chapter 6, Addressing

Modes, for detailed information on addressing.

Four groups of addressing modes are provided on the ’C4x. Each group uses

two or more of several different addressing types. The following list shows the

addressing modes with their addressing types.

_ General addressing modes:

_ Register. The operand is a CPU register.

_ Immediate. The operand is a 16-bit immediate value.

_ Direct. The operand is the contents of a 32-bit address

(concatenation of 16 bits of the data page pointer and a 16-bit

operand).

_ Indirect. A 32-bit auxiliary register indicates the address of the

operand.

_ Three-operand addressing modes:

_ Register. (same as for general addressing mode).

_ Indirect. (same as for general addressing mode).

_ Immediate. The operand is an 8-bit immediate value.

_ Parallel addressing modes:

_ Register. The operand is an extended-precision register.

_ Indirect. (same as for general addressing mode).

_ Branch addressing modes:

_ Register. (same as for general addressing mode).

_ PC-relative. A signed 16-bit displacement or a 24-bit displacement is

added to the PC.

Internal Bus Operation

A large portion of the ’C4x’s high performance is due to internal busing and parallelism.

Page 10: TMS320C4X Digital Signal Processing

Separate buses allow for parallel program fetches, data accesses,

and DMA accesses:

_ Program buses PADDR and PDATA

_ Data buses DADDR1, DADDR2, and DDATA

_ DMA buses DMAADDR and DMADATA

These buses connect all of the physical spaces (on-chip memory, off-chip

memory, and on-chip peripherals) supported by the ’C4x. Figure 2–3 shows

these internal buses and their connections to on-chip and off-chip memory

blocks.

The program counter (PC) is connected to the 32-bit program address bus

(PADDR). The instruction register (IR) is connected to the 32-bit program data

bus (PDATA). In this configuration, the buses can fetch a single instruction

word every machine cycle.

The 32-bit data address buses (DADDR1 and DADDR2) and the 32-bit data

data bus (DDATA) support two data memory accesses every machine cycle.

The DDATA bus carries data to the CPU over the CPU1 and CPU2 buses. The

CPU1 and CPU2 buses can carry two data memory operands to the multiplier,

ALU, and register file every machine cycle. Also internal to the CPU are register

buses REG1 and REG2, which can carry two data values from the register

file to the multiplier and ALU every machine cycle. Figure 2–2 shows the buses

that are internal to the CPU section of the processor.

The DMA controller is supported with a 32-bit address bus (DMAADDR) and

a 32-bit data bus (DMADATA). These buses allow the DMA to perform memory

accesses in parallel with the memory accesses occurring from the data and

program buses.

External Bus Operation

The ’C4x provides two identical external interfaces: the global memory interface

and the local memory interface. Each consists of a 32-bit data bus, a

31-bit (’C40) or 24-bit (’C44) address bus, and two sets of control signals. Both

buses can be used to address external program/data memory or I/O space.

The buses also have external RDY signals for wait-state generation with wait

Page 11: TMS320C4X Digital Signal Processing

states inserted under software control. Chapter 9, External Bus Operation,

covers external bus operation.

For multiple processors to access global memory and share data in a coherent

manner, arbitration is necessary. This arbitration (handshaking) is the purpose

of the ’C4x’s interlocked operations, handled through interlocked instructions.

Interrupts

The ’C4x supports four external interrupts (IIOF3–0), a number of internal interrupts,

a nonmaskable external NMI interrupt, and a nonmaskable external

RESET signal, which sets the processor to a known state. The DMA and communication

ports have their own internal interrupts. When the CPU responds

to the interrupt, the IACK pin can be used to signal an external interrupt acknowledge.

Peripherals

All ’C4x on-chip peripherals are controlled through memory-mapped registers

on a dedicated peripheral bus. This peripheral bus is composed of a 32-bit data

bus and a 32-bit address bus. This peripheral bus permits straightforward

communication to the peripherals. The ’C4x peripherals include two timers

and six (’C40) or four (’C44) communication ports.

Communication Ports

Six (’C40) or four (’C44) high-speed communication ports provide rapid processor-

to-processor communication through each port’s dedicated communication

interfaces. Coupled with the ’C4x’s two memory interfaces (global and

local), this allows you to construct a parallel processor system that attains optimum

system performance by distributing tasks among several processors.

Each ’C4x can pass the results of its work to another ’C4x through a communication

port, enabling each ’C4x to continue working. Chapter 12, Communication

Ports, explains communication port operation in detail.

The communication ports offer several features:

_ 160-megabits/s (20-Mbytes or 5-Mwords per second) bidirectional data

transfer operations (at 40-ns cycle time)

_ Simple processor-to-processor communication via eight data lines and

four control lines

Page 12: TMS320C4X Digital Signal Processing

_ Buffering of all data transfers, both input and output

_ Automatic arbitration to ensure communication synchronization

_ Synchronization between the CPU or the direct-memory access (DMA)

coprocessor and the six communication ports via internal interrupts and

internal ready signals.

_ Port direction pin (CDIR) to ease interfacing (’C44 only)

Direct Memory Access (DMA) Coprocessor

The six channels of the on-chip DMA coprocessor can read from or write to any

location in the memory map without interfering with the operation of the CPU.

This allows interfacing to slow external memories and peripherals without reducing

throughput to the CPU. The DMA coprocessor contains its own address

generators, source and destination registers, and transfer counter. Dedicated

DMA address and data buses allow for minimization of conflicts between

the CPU and the DMA coprocessor. A DMA operation consists of a

block or single-word transfer to or from memory. A key feature of the DMA

coprocessor is its ability to automatically reinitialize each channel following a

data transfer.

Timers

The two timer modules are general-purpose 32-bit timer/event counters with

two signaling modes and internal or external clocking. They can signal internally

to the ’C4x or externally to the outside world at specified intervals, or they

can count external events. Each timer has an I/O pin that can be used as an

input clock to the timer, as an output signal driven by the timer, or as a general purpose

I/O pin.