DAAD Program Akademischer Neuaufbau Südosteuropa Embedded System Design Course SOC Design: From...

DAAD Program „Akademischer Neuaufbau Südosteuropa“

Embedded System Design Course

SOC Design: From System to Transistor

Zoran Stamenković

DAAD Program „Akademischer Neuaufbau Südosteuropa“, Embedded System Design Course at University of Nis,Faculty of Electronic Engineering, 29.06.-03.07.2009

Outline

• Modeling Systems

• Simulation and Verification

• Analog Integrated Circuits

• Digital Integrated Circuits

• Embedded Memories

• Logic Synthesis

• Design for Testability

• Layout Generation

• Design for Manufacturability

• SOC Example


• Domains and Levels

• ESL Design

• Basics of HDL

• Gate Modeling

• Delay Modeling

• Power Modeling

• Effects of Parasitics

• Logic Optimization

Modeling Systems


Domains and Levels

• Open Systems Interconnection (OSI) model of network communication

Local area network (LAN) technologies are defined by standards that describe unique functions at both the Physical and the Data Link layers


• 802.11 Wireless LAN modem

Modulates outgoing digital signals from a computer or other digital device to an analogue (radio) signal

Demodulates the incoming analogue (radio) signal and converts it to a digital signal for the digital device

Domains and Levels


MIMO and MIMAX WLAN Modems

Signal processing performed in the analogue RF domain

Number of the digital basebands reduced to a single one

Signal processing performed in the digital baseband

Domains and Levels


BehavioralStructural

Physical

Analysis

Synthesis

Generation

Extraction

RefinementAbstraction

Domains and Levels



Physical

Algorithm(behavioral)

Register-TransferLanguage

Boolean Equations

Differential Equations

Behavioral Domain



Physical

Processor-MemorySystem

Registers

Gates

Transistors

Structural Domain



Physical

Polygons

Sticks

Standard Cells

Floor Plan

Physical Domain


• The point of a system level model is to capture the intent of the design

Design does exactly what it is defined to do, and the model is the definition of what the design does

It allows software developers to test their code on a working model

• The value of system level modeling is in helping us to understand the implications of our intent

To explore responses to the stimulus in an useful way

• ESL Languages

UML, SystemC, SystemVerilog

• ESL Verification

“No amount of experimentation can ever prove me right; a single experiment can prove me wrong” – Albert Einstein

The system level testbench languages and methodologies that exist today are woefully inadequate

If one tries to capture enough information in ESL to verify RTL, then one might as well write RTL

Electronic System Level Design


• The environment that provides models of memories, connectors, and queues that can be interconnected with configured processors into an overall system model

Processor and device interfaces are at the transaction level

Transaction-level modeling requests for SOC architecture assembly and simulation tools

If RTL IP blocks present, HW/SW co-verification tools needed


XtensaProcessorGenerator

Use standard ASIC/COT design techniques andlibraries for any IC fabrication process

Complete Hardware DesignSource pre-verified RTL, EDA scripts, test suite

CustomizedSoftware Tools

C/C++ compiler Debuggers SimulatorsRTOSes

ProcessorExtensions

int main( ){ int i; short c[100]; for (i=0;i<N/2;i++) {

int main( ){ int i; short c[100]; for (i=0;i<N/2;i++) {

ANSI C/C++ Code

XPRESCompiler

ProcessorConfiguration1. Select from menu2. Add instruction

description (TIE)3. Automatic instruction

discovery (XPRES)



Auto-generated XTMP model based on memory maps

• Specify chip-level memory maps for shared/private memories

• Place interrupt and reset vectors• Assign code/data to distributed

memories


• Motivation for HDL

Increased hardware complexity

Design space exploration

Inexpensive alternative to prototyping

• General features

Support for describing circuit connectivity

High-level programming language support for describing behavior

Support for timing information (constraints, etc.)

Support for concurrency

• VHDL

IEEE Standard 1076-1987


Extension VHDL-AMS-1999

• Verilog



Hardware Description Languages


• Entity (VHDL) or Module (Verilog) declaration

Describes the input/output ports of a module

entity reg3 isport ( d0, d1, d2, en, clk : in bit;

q0, q1, q2 : out bit );end;

name port names

port mode (direction)

port type (VHDL only)

reserved words punctuation

module reg3 ( d0, d1, d2, en, clk, q0, q1, q2 ); input d0, d1, d2, en, clk;output q0, q1, q2;

endmodule

VHDL

Verilog

Modeling Interfaces


• Architecture Body (VHDL)

Describes an implementation of an entity

May be several per entity

• Module (Verilog)

Is unique

• Behavioral Architecture

Describes the algorithm performed by the module

Contains

Procedural Statements, each containing

Sequential Statements, including

Assignment Statements and

Wait Statements

Modeling Behavior



q0, q1, q2 : out bit );end;architecture behav of reg3 isbegin

process ( d0, d1, d2, en, clk ) begin

if en = '1' and clk = '1' then q0 <= d0 after 5 ns; q1 <= d1 after 5 ns; q2 <= d2 after 5 ns;end if;

end process;end;

`timescale 1ns/10psmodule reg3 ( d0, d1, d2, en, clk,

q0, q1, q2 ); input d0, d1, d2, en, clk;output q0, q1, q2; reg q0, q1, q2;

always @ ( d0 or d1 or d2 or en or clk ) if ( en & clk ) begin q0 <= #5 d0; q1 <= #5 d1; q2 <= #5 d2; endendmodule Verilog

VHDL

Behavior Example


• Structural Architecture

Implements the module as a composition of components

Contains

Signal Declarations (entity ports are also signals)

Declare internal connections

Component Instances

Instantiate previously declared entity/architecture pairs

Port Maps in component instances

Connect signals to component ports

Wait Statements

Suspend a process or procedure

Modeling Structure


Structure Example

d0

d1

d2

q0

q1

q2

bit0

d_latch

d

clk

q

bit1

d_latch

d

clk

q

bit2

d_latch

d

clk

q

int_clken

clk

gate

and2

a

b

y


entity d_latch isport ( d, clk : in bit; q : out bit );

end;architecture basic of d_latch isbegin

process ( d, clk ) begin

if clk = ‘1’ thenq <= d after 5 ns;

end if;end process;

end;

entity and2 isport ( a, b : in bit; y : out bit );

end;architecture basic of and2 isbegin

process ( a, b )begin

y <= a and b after 5 ns;end process;

end;

`timescale 1ns/10psmodule d_latch ( d, clk, q );

input d, clk;output q; reg q;

always @ ( d or clk ) if ( clk ) begin q <= #5 d; endendmodule

`timescale 1ns/10psmodule and2 ( a, b, y );

input a, b;output y; reg y;

always @ ( a or b ) begin y <= #5 ( a & b ); endendmodule

VHDL

VHDL

Verilog

Verilog

Structure Example



q0, q1, q2 : out bit );end;architecture struct of reg3 is

component d_latchport ( d, clk : in bit; q : out bit );

end component;

component and2port ( a, b : in bit; y : out bit );

end component;

signal int_clk : bit;

begin

bit0 : d_latch port map ( d0, int_clk, q0 );



gate : and2 port map ( en, clk, int_clk );

end;

module reg3 ( d0, d1, d2, en, clk, q0, q1, q2 );

input d0, d1, d2, en, clk;output q0, q1, q2;

wire int_clk; d_latch bit0 ( d0, int_clk, q0 ); d_latch bit1 ( d1, int_clk, q1 ); d_latch bit2 ( d2, int_clk, q2 ); and2 gate ( en, clk, int_clk );endmodule

VHDL

Verilog

Structure Example


• An architecture can contain both behavioral and structural parts

Process Statements and Component Instances

Collectively called Concurrent Statements

Processes can read and assign to signals

• Example: Register-transfer-language model

Data-path described structurally

Control section described behaviorally

Mixing Behavior and Structure


shift_reg

reg

shift_adder

control_section

multiplier multiplicand

product

Mixed Example


entity multiplier isport ( clk, reset : in bit; multiplicand, multiplier : in integer;

product : out integer );end;

architecture mixed of multiplier is

signal partial_product, full_product : integer;signal arith_control, result_en, mult_bit, mult_load : bit;

begin

arith_unit : entity work.shift_adder(behavior)port map ( addend => multiplicand, augend => full_product,

sum => partial_product,add_control => arith_control );

result : entity work.reg(behavior)port map ( d => partial_product, q => full_product,

en => result_en, reset => reset );

...

Mixed Example


…

multiplier_sr : entity work.shift_reg(behavior)port map ( d => multiplier, q => mult_bit,

load => mult_load, clk => clk );

product <= full_product;

control_section : process is-- variable declarations for control_section-- …

begin-- sequential statements to assign values to control signals-- …wait on clk, reset;

end process control_section;

end;

Mixed Example


• Function

f = a’b + ab’: a is a variable, a and a’ are literals, ab’ is a term

• Irredundant Function

No literal can be removed without changing its value

• Implementing logic functions is non-trivial

No logic gates in the library for all logic expressions

A logic expression may map into gates that consume a lot of area, time, or power

• A set of functions f1, f2, ... is complete if every Boolean function can be generated by a combination of the functions from the set

NAND is a complete set

NOR is a complete set

AND and OR are not complete

Transmission gates are not complete

• Incomplete set of logic gates

No way to design arbitrary logic

Logic Functions


a out

+

c

a

Inverter


Inverter

P-type substrate (bulk)

SiO2

metal (Al)

N+N+ drainN+ source

N well

P+ sourceP+ drain

INPUT INPUT

V+

OUTPUTGND

N well

Active areas

Polysilicon

P+(N+) doping

Contacts

Metal

(a)

(b)

(c)

Input

Output

V+


• Complementary switch produces full-supply voltages for both logic 0 and logic 1

n-type transistor conducts logic 0

p-type transistor conducts logic 1

Switches


b

+

a

out

b

a

VDD

GND

tubties

out

NAND Gate


b

a

out

a

b

VDD

GND

tub ties

out

NOR Gate


• AOI = and/or/invert

• OAI = or/and/invert

• Implement larger functions

• Pull-up and pull-down networks are compact

Smaller area, higher speed than NAND/NOR network equivalents

• AOI312

And 3 inputs

And 1 input (dummy)

And 2 inputs

Or together these terms

Invert

AOI/OAI Gates

and

or

invert

out = [ab+c]’


• Solid logic 0/1 defined by VSS/VDD

• Inner bounds of logic values VL/VH are not directly determined by circuit properties, as in some other logic families

logic 1

logic 0

unknown

VDD

VSS

VH

VL

• Levels at output of one gate must be sufficient to drive next gate

Logic Levels


• Choose threshold voltages at points where slope of transfer curve is -1

• Inverter has

High gain between VIL and VIH points

Low gain at outer regions of transfer curve

• Note that logic 0 and 1 regions are not equally sized

In this case, high pull-up resistance leads to smaller logic 1 range

• Noise margins are VDD-VIH and VIL-VSS

Noise must exceed noise margin to make second gate produce wrong output

Inverter Transfer Curve


• Resistor model of transistor

Ignores saturation region

Mischaracterizes linear region

Gives acceptable results

Inverter Delay

• Only one transistor is on at the time

Rise time (pull-up on)

Fall time (pull-up off)


• Delay

Time required for gate’s output to reach 50% of final value

• Transition time

Time required for gate’s output to reach 10% (logic 0) or 90% (logic 1) of final value

• Gate delay based on RC time constant

Vout(t) = VDD exp{-t/[(Rn+RL)CL]}

td = 0.69 RnCL

tf = 2.3 RnCL

• 0.5 m process

Rn = 3.9 k

CL = 0.68 fF

td = 0.69 x 3.9 x .68E-15 = 1.8 ps

tf = 2.3 x 3.9 x .68E-15 = 6.1 ps

• For pull-up time, use pull-up resistance

• Current source model (in power/delay studies)

tf = CL (VDD-VSS)/[0.5 k’ (W/L) (VDD-VSS -Vt)2]

• Fitted model

Fit curve to measured circuit characteristics

RC Model for Delay


Step Input (VGS = VDD) Approximation


• Source voltage of gates in middle of network may not equal substrate voltage

Difference between source and substrate voltages causes body effect

0

0

Source above VSS

Early arriving signal

• To minimize body effect

Put early arriving signals at transistors closest to power supply

Body Effect


• Clock frequency

f = 1/t

• Energy

E = CL(VDD - VSS)2

• Power

E x f = f CL(VDD - VSS)2

• Almost all power consumption comes from switching behavior

A single cycle requires one charge and one discharge of capacitor

• Static power dissipation

Comes from leakage currents

• Surprising result

Resistance of the pull-up/pull-down transistor drops out of energy calculation

Power consumption is independent of the sizes of the pull-up and pull-down transistors

• Static CMOS power-delay product is independent of frequency

Voltage scaling depends on this fact

Power Consumption


• Capacitance on power supply is not bad

Can be good in absence of inductance

• Resistance slows down static gates

May cause pseudo-nMOS circuits to fail

• Increasing capacitance/resistance

Reduces input slope

• Resistance near source is more damaging

It must charge more capacitance

Effects of Parasitics


• Sometimes, large loads must be driven

Off-chip or by long wires on-chip

• Sizing up the driver transistors only pushes back the problem

Driver now presents larger capacitance to earlier stage

• Use a chain of inverters

Each stage has transistors larger than previous stage

is the driver size ratio, Cbig/Cd=n, ln(Cbig/Cd) = n ln

• Minimize total delay through the driver chain

ttot = ln(Cbig/Cd)(/ln)td

• Optimal driver size ratio is opt = e

• Optimal number of stages is nopt = ln(Cbig/Cd)

Optimal Sizing


Driving Large Fan-Out

• Fan-out adds capacitance

• Increase sizes of driver transistors

Must take into account rules for driving large loads

• Add intermediate buffers

This may require/allow restructuring of the logic


• Network delay is measured over paths through network

Can trace a causality chain from inputs to worst-case output

• Critical path creates longest delay

Can trace transitions which cause delays that are elements of the critical path delay

• To reduce circuit delay, speed up the critical path

Reducing delay off the path doesn’t help

• There may be more than one path of the same delay

Must speed up all equivalent paths to speed up circuit

Path Delay


• Logic gates are not simple nodes

Some input changes don’t cause output changes

• A false path is a path which cannot be exercised due to Boolean gate conditions

False paths cause pessimistic delay estimates

False Paths


• Rewrite by using sub-expressions

Logic rewrites may affect gate placement

• Flattening logic

Increases gate fan-in

• Logic synthesis programs

Transform Boolean expressions into logic gate networks in a particular library

Deep LogicShallow Logic

Logic Transformations


• Optimization goals

Minimize area, meet delay constraint

• Technology-independent optimization

Works on Boolean expression equivalent

Estimates size based on number of literals

Uses factorization, resubstitution, minimization, etc.

Uses simple delay models

• Technology-dependent optimization

Maps Boolean expressions into a particular cell library

May perform some optimizations on addition to simple mapping

Allows more accurate delay models

Logic Optimization


• Simulation

• Verification

• Annotation

Simulation and Verification


• Simulation

Tests the functionality of a design’s elaborated model

Needs a test bench and a simulation tool

Advances in discrete time steps

• Test Bench

Includes an instance of the design under test

Applies sequences of test values to inputs

Monitors signal values on outputs using simulator

• Simulation Tools

NCSIM (Cadence)

VSIM (Mentor Graphics)

VCS (Synopsys)

Simulation


• Event-driven simulation is designed for digital circuit characteristics

Small number of signal values

Relatively sparse activity over time

• Event-driven simulators try to update only those signals which change in order to reduce CPU time requirements

An event is a change in a signal value

A time-wheel is a queue of events

• Simulator traces structure of circuit to determine causality of events

Event at input of one gate may cause new event at gate’s output

Event-Driven Simulation


• Special type of event-driven simulation optimized for MOS transistors

Treats the transistor as a switch

Takes capacitance into account to model charge sharing

• Can also be enhanced to model the transistor as a resistive switch

A

B

CD

Switch Simulation


entity test_bench isend;

architecture test_reg3 of test_bench is

signal d0, d1, d2, en, clk, q0, q1, q2 : bit;

begin

dut : entity work.reg3(behav)port map ( d0, d1, d2, en, clk, q0, q1, q2 );

stimulus : process isbegin

d0 <= ’1’; d1 <= ’1’; d2 <= ’1’; wait for 20 ns; en <= ’0’; clk <= ’0’; wait for 20 ns;en <= ’1’; wait for 20 ns;clk <= ’1’; wait for 20 ns;d0 <= ’0’; d1 <= ’0’; d2 <= ’0’; wait for 20 ns;…wait;

end process stimulus;

end;

Test Bench Example


• To test a refinement of a design

Low-level structural model must be functionally the same as a corresponding behavioral model

• To include two instances of a design in the test bench

To stimulate both with same test values on inputs

To compare values of outputs for equality

• To take account of timing differences

Zero delay

Unit delay

Gate delay

RC delay

Verification


architecture regression of test_bench is

signal d0, d1, d2, d3, en, clk : bit;signal q0a, q1a, q2a, q3a, q0b, q1b, q2b, q3b : bit;

begin

dut_a : entity work.reg4(struct)port map ( d0, d1, d2, d3, en, clk, q0a, q1a, q2a, q3a );

dut_b : entity work.reg4(behav)port map ( d0, d1, d2, d3, en, clk, q0b, q1b, q2b, q3b );

stimulus : process isbegin

d0 <= ’1’; d1 <= ’1’; d2 <= ’1’; d3 <= ’1’; wait for 20 ns; en <= ’0’; clk <= ’0’; wait for 20 ns;en <= ’1’; wait for 20 ns;clk <= ’1’; wait for 20 ns;…wait;

end process stimulus;

...

Verification Example


…

verify : process isbegin

wait for 10 ns;assert q0a = q0b and q1a = q1b and q2a = q2b and q3a = q3b

report ”implementations have different outputs”severity error;

wait on d0, d1, d2, d3, en, clk;end process verify;

end architecture regression;

Verification Example


• Standard Delay Format (SDF) annotation

Design timing is stored in an SDF file

Used to iteratively improve design

• Updates a more-abstract design with information from later design stages

Annotation of logic schematic with extracted parasitic resistances and capacitances

• Back annotation requires tools to know more about each other

Simulation tools

Synthesis tools

Layout tools

Annotation


(DELAYFILE

(SDFVERSION "OVI 1.0")

(DESIGN "tcp_1_chip")

(DATE "Fri Apr 30 09:48:22 2004")

(VENDOR "cdr3synPwcslV225T125")

(PROGRAM "Synopsys Design Compiler cmos")

(VERSION "2003.06")

(DIVIDER /)

(VOLTAGE 2.25:2.25:2.25)

(PROCESS)

(TEMPERATURE 125.00:125.00:125.00)

(TIMESCALE 1ns)

(CELL

(CELLTYPE "tcp_1_chip")

(INSTANCE)

(DELAY

(ABSOLUTE

(INTERCONNECT U5/x U81/a (0.000:0.000:0.000))

(INTERCONNECT U73/x U74/a (0.000:0.000:0.000))

...

)

)

)

(CELL

(CELLTYPE "exnor2_1")

(INSTANCE i_aes_wr/U_ALG/U6533)

(DELAY

(ABSOLUTE

(IOPATH a x (0.662:1.045:1.045) (0.682:1.076:1.076))

(IOPATH b x (1.379:1.416:1.416) (1.454:1.492:1.492))

)

)

)

...

(CELL

(CELLTYPE "mux2_2")

(INSTANCE i_mips/u0/ejt_tap\/pa_addr_reg_next\/bit_00i/U1)

(DELAY

(ABSOLUTE

(IOPATH d0 x (0.395:0.395:0.395) (0.464:0.464:0.464))

(IOPATH d1 x (0.387:0.403:0.403) (0.447:0.477:0.477))

(IOPATH sl x (1.768:1.781:1.781) (1.879:1.892:1.892))

)

)

)

)

Standard Delay Format


• Filters

• Amplifiers

• Phase Lock Loop

• Voltage Control Oscillator

• Modulator/Demodulator

Analog Integrated Circuits


Fairchild Semiconductor μA741 Op-Amp

• In 1963, a 26-year-old engineer named Robert Widlar designed the first monolithic op-amp IC, the μA702

• Price at the beginning was $300

• Fairchild and competitors have sold it in the hundreds of millions

• Now, for $300 you can get about a thousand of today’s 741 chips


Signetics NE555 Timer

• A simple IC from 1971 that could function as a timer or an oscillator

• It would become a best seller in analog semiconductors

Kitchen appliances

Toys

Spacecraft

A few thousand other things

• Many billions have been sold


Intersil ICL8038 Waveform Generator

• A generator of sine, square, triangular, sawtooth, and pulse waveforms from 1983

• Countless applications

Music synthesizers

“Blue boxes”

• Hundreds of millions sold

• Intersil discontinued the production in 2002


LNA in BiCMOS Technology


PLL for 802.11a WLAN


Oscillator


Modulator


• Adders

• Multipliers

• Shifters

• Carry Units

• Arithmetic-Logic Units

Digital Integrated Circuits


• Computes one-bit sum and carry

si = ai bi cin

cout = aibi + aici + bicin

• Ripple-carry adder: n-bit adder built from full adders

• Delay of ripple-carry adder goes through all carry bits

Full Adder


0 1 1 0 multiplicand

x 1 0 0 1 multiplier

0 1 1 0

+ 0 0 0 0

0 0 1 1 0

+ 0 0 0 0

0 0 0 1 1 0

+ 0 1 1 0

0 1 1 0 1 1 0

partial product

Combinational Multiplier


• Array multiplier is an efficient layout of a combinational multiplier

• Array multipliers may be pipelined to decrease clock period at the expense of latency

xny0

+ 0+

P(2n-1) P(2n-2)

+

x0y0x1y0x2y00

x0y1+ x1y1

0

+ x0y2+ x1y2

P0

Array Multiplier


• Reduces depth of adder chain• Built from carry-save adders

Three inputs a, b, c

Produces two outputs y, z

y + z = a + b + c• Carry-save equations

yi = parity (ai,bi,ci)

zi = majority (ai,bi,ci)

• At each stage, i numbers are combined to form 2i/3-sums

• Final adder completes the summation

• Wiring is more complex

Wallace Tree


• Used in serial-arithmetic operations• Multiplicand can be held in place by register• Multiplier is shifted into array

Serial-Parallel Multiplier


• Can perform n-bit shifts in a single cycle

• Accepts 2n data inputs and n control signals, producing n data outputs

• Selects arbitrary contiguous n bits out of 2n input buts

• Examples

Right shift: data into top, 0 into bottom

Left shift: 0 into top, data into bottom

Rotate: data into top and bottom

data 1data 2n bits

n bits

output

n bits

Barrel Shifter


• Two-dimensional array of 2n vertical X n horizontal cells

• Input data travels diagonally upward

Output wires travel horizontally

• Control signals run vertically

Exactly one control signal is set to 1, turning on all transmission gates in that column

• Large number of cells, but each one is small

• Delay is large, considering long wires and transmission gates

Barrel Shifter


• First computes carry propagate and generate

Pi = ai + bi

Gi = aibi

• Computes sum and carry from P and G

si = ci Pi Gi

ci+1 = Gi + Pici

• Can recursively expand carry formula

ci+1 = Gi + Pi(Gi-1 + Pi-1ci-1)

ci+1 = Gi + PiGi-1 + PiPi-1 (Gi-2 + Pi-1ci-2)

• Expanded formula does not depend on intermediate carries

• Allows carry for each bit to be computed independently

Carry-Lookahead Unit


• Deepest carry expansion requires gates with large fan-in

Large and slow

• Carry-lookahead unit requires complex wiring between adders and lookahead unit

Values must be routed back from lookahead unit to adder

Depth-4 Carry-Lookahead Unit


• Looks for cases in which carry out of a set of bits is identical to carry in

• Typically organized into m-bit stages

• If ai = bi for every bit in stage, then bypass gate sends stage’s carry input directly to carry output

Carry-Skip Adder


• Computes two results in parallel, each for different carry input assumptions

• Uses actual carry in to select correct result

• Reduces delay to multiplexer

Carry-Select Adder


• Precharged carry chain which uses P and G signals

• Propagate signal connects adjacent carry bits

• Generate signal discharges carry bit

• Worst-case discharge path goes through entire carry chain

Manchester Carry Chain


• May be used in signal-processing arithmetic where fast computation is important but latency is unimportant

• LSB control signal clears the carry shift register

Serial Adder


• Computes a variety of logical and arithmetic functions based on opcode

• May offer complete set of functions of two variables or a subset

• Built around adder, since carry chain determines delay

• Function block may be used to compute required intermediate signals for a full-function ALU

Transmission gates may introduce significant delay

Arithmetic-Logic Unit


• P and G compute intermediate values from inputs

May not correspond to carry lookahead P and G for non-addition functions

• Add unit is adder of choice

• Output unit computes from sum, propagate signal

Arithmetic-Logic Unit


• 32-bit RISC microprocessor from 1985

• The simplicity made all the difference

Small, low power, and easy to program

• ARM architecture has become the dominant embedded processor

More than 10 billion ARM cores have been used in all sorts of gadgetry, including the iPhone

Acorn Computers ARM1 Processor


• Russell Fish and Chuck Moore 1988 found a way to have the processor run its own super fast internal clock while still staying synchronized with the rest of the computer

• In the years since Sh-Boom’s invention, the speed of processors had by far surpassed that of motherboards, and so practically every maker of computers and consumer electronics wound up using the same solution

Since 2006, Patriot Scientific (and Moore) have reaped over US $125 million in licensing fees from Intel, AMD, Sony, Olympus, and others

Computer Cowboys Sh-Boom Processor


• Microchip Technology PIC16C84 8-bit microcontroller in 1993

Incorporates EEPROM

Does not need UV light to be erased as EPROM needs

8-bit Microprocessors

• Radiation-hardened RCA CDP 1802 8-bit microprocessor in 1976

One of the first, if not the first, CMOS processors

Low power consumption, wide range of operating voltages and military operating temperature range


• Read-Only Memory

• Static Random-Access Memory

• Dynamic Random-Access Memory

• Memory Generators

Embedded Memories


• Address is divided into row and column

Row may contain full word or more than one word

• Selected row drives/senses bit lines in columns

• Amplifiers/drivers read/write bit lines

Memory Architecture


• ROM core is organized as an array of NOR gates

Pull-down transistors of NOR determine programming

• Erasable ROMs require special processing that is not typically available

• ROMs on digital ICs are generally mask-programmed

Placement of pull-downs determines ROM contents

Read-Only Memory (ROM)


• Core cell uses six-transistor circuit to store value

• Value is stored symmetrically

Both true and complement are stored on cross-coupled transistors

• SRAM retains value as long as power is applied to circuit

• Read

Precharge bit and bit’ high

Set select line high from row decoder

One bit line will be pulled down

• Write

Set bit/bit’ to desired (complementary) values

Set select line high

Drive on bit lines will flip state if necessary

Static Random-Access Memory (SRAM)


• Differential pair

Takes advantage of complementarity of bit lines

• One bit line goes low

One arm of diff pair reduces its current, causing compensating increase in current of another arm

• Sense amp can be cross-coupled to increase speed

SRAM Sense Amplifier


Dynamic Random-Access Memory (DRAM)

• Cell can easily be made with a CMOS digital technology process

• Dynamic RAM loses value due to charge leakage

Must be refreshed

• Value is stored on gate capacitance of transistor t1

• Read

read = 1, write = 0, read_data’ is precharged

t1 will pull down read_data’ if 1 is stored

• Write

read = 0, write = 1, write_data = value

Guard transistor writes value onto gate capacitance

• Modern commercial DRAMs use one-transistor cell


• In 1980, Fujio Masuoka recruited four engineers to a project aimed at designing a memory chip that could store lots of data and would be affordable

Team came up with a variation of EEPROM that featured a memory cell consisting of a single transistor (at the time, conventional EEPROM needed two transistors per cell)

• Why is it named “flash”?

Because of the chip’s ultrafast erasing capability

• In 1984 Masuoka presented a paper at the IEEE International Electron Devices Meeting

In 1988 Intel developed a type of flash based on NOR logic gates (a 256‑kilobit chip)

Toshiba’s first NAND flash (greater storage densities but trickier to manufacture) hit the market in 1989

Toshiba NAND Flash Memory


Memory Generators

• A software tool which can create memories (ROM or RAM blocks) in a range of sizes as needed

The customer usually wants a particular number of words (depth) and bits (width) for each memory ordered

Each of the final building blocks (physical layout) will be implemented as a stand-alone, densely packed, pitch-matched array

• Complex layout generators and state-of-the-art logic and circuit design techniques offer

Embedded memories of extreme density and performance

• Each memory generator is a set of various, parameterized generators

Layout generator generates an array of custom, pitch-matched leaf cells

Schematic generator and Net-lister extracts a net-list used for both layout vs. schematic and functional verification

Function and Timing model generators create models for gate level simulation, dynamic/static timing analysis and synthesis

Symbol generator generates schematic

Critical Path generator is used for both circuit design and timing characterization


Logic Synthesis

• Logic Synthesis Flow

• Optimization

• Technology Mapping

• Low-Power Techniques


Optimized Net-listSDF File

Optimized Net-listSDF File

yes

Architectural DescriptionArchitectural Description

RTL DescriptionRTL Description

Computer-Aided SynthesisComputer-Aided Synthesis

Design ConstraintsDesign Constraints

Constraints Metno

Gate-Level Net-listGate-Level Net-list

Standard Cell LibraryStandard Cell Library

Logic Synthesis Flow

• Goal is to create a logic gate network which performs a given set of functions

Input is Boolean formulae

Output is gates implementing Boolean functions

Several iterations needed for generation of the optimized gate-level description

• Logic synthesis

Maps onto available gates

Restructures for delay, area, testability, power, etc.

• Automated logic synthesis has enabled

Enormous reduction of the time needed for conversion of a design from high-level to gate-level description

Saving of designer resources for architectural and RTL descriptions, and optimization of the standard cell library


• Scheduling determines

Number of clock cycles required

As-soon-as-possible (ASAP) schedule puts every operation as early in time as possible

As-late-as-possible (ALAP) schedule puts every operation as late in schedule as possible

• Binding determines

Area and cycle time

• Area tradeoffs must consider

Shared function units vs. multiplexers and control

• Delay tradeoffs must consider

Cycle time vs. number of cycles

High-Level Synthesis


• Technology-independent optimizations

A Boolean network is the main representation of the logic functions

Each node can be represented as sum-of-products (or product-of-sums)

Functions in the network need not correspond to logic gates

• Technology mapping (library binding)

Design transformation from technology-independent to technology-dependent

• Technology-dependent optimizations

Work in the available set of logic gates

Logic Synthesis Phases


• Area is estimated by number of literals

Literal is true or complement form of a variable

• Simplification

Rewrites a node to reduce the number of literals in it

• Network restructuring

Introduces new nodes for common factors

Collapses several nodes into one new node

• Delay restructuring

Changes factorization to reduce path length

out1 = k2 + x2’ out2 = k3 + x1

k2 = x1’ x2 x4 + k1k3 = k1 x4’

k1 = x2 + x3

x1 x2 x3 x4

Technology-Independent Optimization


• Function is defined by

On-set: set of inputs for which output is 1

Off-set: set of inputs for which output is 0

Don’t-care-set: set of inputs for which output is don’t-care

• Each way to write a function as a sum-of-products is a cover

It covers the on-set

• A cover is composed of cubes

Cubes are product terms that define a subspace cube in the function space

x1

x2

x3

0

1

1

1

Covers and Cubes


• Larger cover

x1’ x2’ x3’ + x1 x2’ x3’ + x1’ x2 x3’ + x1 x2 x3

Requires four cubes (12 literals)

• Smaller cover

x2’ x3’ + x1’ x3’ + x1 x2 x3

Requires three cubes (7 literals)

x1’ x2 x3’ is covered by two cubes

• Don’t-cares

Can be implemented in either on-set or off-set

Provide the greatest opportunities for minimization in many cases

• Espresso

A two-level logic optimizer

Expands, makes irredundant and reduces

Optimization loop refines cover to reduce its size

Covers and Optimizations


• Based on division

Formulate candidate divisor

Test how it divides into the function

If g = f/c, we can use c as an intermediate function

• Algebraic division

Doesn’t take into account Boolean simplification

Less expensive then Boolean division

• Three steps

Generate potential common factors and compute literal savings if used

Choose factors to substitute into network

Restructure the network to use the new factors

• Algebraic/Boolean division is used to implement first step

Factorization


• Rewrites Boolean network

In terms of available logic functions

• Optimizes for

Area

Delay

• Can be viewed as a pattern matching problem

Find pattern match which minimizes area/delay cost

• Procedure

Write Boolean network in canonical NAND form

Write each library gate in canonical NAND form

Assign cost to each library gate

Use dynamic programming to select minimum-cost cover of network by library gates

Technology Mapping


Breaking into Trees

not optimal, but reasonable cuts usually work well


after three levels of matching

Mapping Example


after four levels of matching

Mapping Example


• Architecture-driven supply voltage scaling

Add extra logic to increase parallelism so that system can run at lower frequency

Power improvement for n parallel units over Vref

Pn(n) = [1 + Ci(n)/nCref + Cx(n)/Cref](V/Vref)

• Dynamic voltage and frequency scaling

Decreased to parts of the circuit where it does not adversely affect the performance

Dynamic scaling is regulated by software based on system load

• Reducing capacitances

Parasitic capacitances of the transistors

Parasitic capacitances of the wires

Low Power Techniques


• Reducing switching activity

Deactivate the clock to unused registers (clock gating)

Deactivate signals if not used (signal gating)

Deactivate VDD for unused hardware blocks (power gating)

Low Power Techniques

• Distributed clocks: Globally Asynchronous Locally Synchronous

Eliminating centrally synchronous clocks and utilizing local clocks

Distinct local clocks, possibly running at different frequencies


Design for Testability

• DFT Methods

• Scan Design

• Test Pattern Generation

• Built-In Self-Test


• Make the system as testable as possible

Keep minimum cost in hardware and testing time

Use knowledge of architecture to help in selection of testability points

Modify architecture to improve testability

• DFT for digital circuits

Ad-hoc methods

Avoid asynchronous feedback

Make flip-flops initializable

Avoid redundant gates, large fan-in gates and gated clocks

Provide test control for difficult-to-control signals

Consider ATE requirements (tri-states, etc.)

Structured methods

Scan Design

Built-in self-test (BIST)

Boundary scan

Design for Testability Methods


• Circuit is designed using pre-specified design rules

• Test structure (hardware) is added to the verified design

Add a test control (TC) primary input

Replace flip-flops by scan flip-flops (SFF) and connect to form one or more shift registers (scan-chains) in the test mode

Make input/output of each scan-chain controllable/observable from primary input/primary output

• Use combinational ATPG to obtain tests for all testable faults in the combinational logic

• Add shift register tests and convert ATPG tests into scan sequences for use in manufacturing test

• Full scan is expensive

Must roll out and roll in state many times during a set of tests

• Partial scan selects some registers (not all) for scanability to reduce the chain length

Analysis is required to choose which registers are best for scan

Scan Design


DTC

SD

CK

Q

QMUX

D flip-flop

Master latch Slave latch

CK

TC Normal mode, D selected Scan mode, SD selected

Master open Slave opent

t

Logicoverhead

Scanable Flip-Flop


Level-Sensitive Scanable Flip-Flop

D

SD

MCK

Q

Q

D flip-flop

Master latch Slave latch

SCK

TCK

SCK

MCK

TCK Norm

al

mode

MCK

TCK Sca

nm

ode

Logic

overhead


Scan Structure

SCANIN

SCANOUT

TC or TCK

SFF

SFF

SFF

Combinationallogic

PI PO

Not shown: CK orMCK/SCK feed allSFFs


Combinational Test Vectors

I2 I1

O1 O2

PI

PO

SCANIN

SCANOUT

S1 S2

N1 N2

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0TC

Sequence length = (ncomb + 1) nsff + ncomb clock periods

ncomb = number of combinational vectors

nsff = number of scan flip-flops


• Scan-chain must be tested prior to application of scan test sequences

• A shift sequence 00110011 . . . of length nsff+4 in scan mode (TC=0)

Produces 00, 01, 11 and 10 transitions in all flip-flops

Observes the result at SCANOUT output

• Total scan test length

(ncomb + 2) nsff + ncomb + 4 clock periods

• Example

2,000 scan flip-flops, 500 comb. vectors, total scan test length ~ 106 clocks

• Multiple scan-chains reduce test length

Testing Scan Chain


• Errors are introduced during manufacturing

Testing weeds out infant mortality

• Varieties of testing

Functional testing

Performance testing

• Fault model

Possible locations of faults

I/O behavior produced by the fault

With a fault model, we can test the network for every possible instantiation of that type of fault

It is difficult to enumerate all types of manufacturing faults

• Testing procedure

Set inputs

Observe output

Compare fault-free and observed output

Testing and Faults


• Logic gate output is always stuck at 0 or 1 independently on input values

• Correspondence to manufacturing defects depends on logic family

• Experiments show that 100% stuck-at-0/1 fault coverage corresponds to high overall fault coverage

• Testing NAND

Three ways to test it for stuck-at-0

Only one way to test it for stuck-at-1

• Testing NOR

Three ways to test it for stuck-at-1

Only one way to test it for stuck-at-0

a b OK SA0SA1

0 0 1 0 1

0 1 1 0 1

1 0 1 0 1

1 1 0 0 1

a b OK SA0SA1

0 0 1 0 1

0 1 0 0 1

1 0 0 0 1

1 1 0 0 1

Stuck-At-0/1 Faults


• Can test both NANDs for stuck-at-0 simultaneously

abc = 000

• Cannot test both NANDs for stuck-at-1 simultaneously due to inverter

Must use two vectors

Must also test inverter

Multiple Test Example


• Transistors always on/off

• t1 is stuck open (switch cannot be closed)

No path from VDD to output capacitance

• Testing requires two cycles

Must discharge capacitor

Try to operate t1 to charge capacitor

Stuck-At-Open/Closed Model


• Two parts of testing

Controlling the inputs of (possibly interior) gates

Observing the outputs of (possibly interior) gates

• Delay faults

Gate delay model assumes that all delays are lumped into one gate

Path delay model takes into account the delay of a path through network

Performance problems

Functional problems in some types of circuits

Combinational Testing Example


• Goal

Test gate D for stuck-at-0 fault

• First step

Justify 0 values on gate inputs

• Work backward from gate to primary inputs

w1 = 0 (A output = 0)

i1 = i2 = 1

• Observe the fault at a primary output

o1 gives different values if D is true/faulty

• Work forward and backward

F’s other input must be 0 to detect true/fault

Justify 0 at E’s output

• In general, may have to propagate fault through multiple levels of logic to primary outputs

Testing Procedure


• Redundant logic can mask faults

• Testing NOR for SA0 requires setting both inputs to 0

• Network topology ensures that one NOR input (for instance b) will always be 1

• Function reduces to 0

f = ((a+b)’ + b)’ = (a + b)b’ = 0

• Redundant logic can introduce delay faults and other problems

Redundancy and Testing


• Much harder than combinational testing

Can’t set memory element values directly

• Must apply sequences

To put machine in proper state for test

To observe value of test

• Testing of NAND for stuck-at-1

Set both NAND inputs to 1

Primary input i1 can be controlled directly

Lower input is 1 if ps0/ps1 = 1

Sequential Testing


• A model for sequential test

Unroll machine in time

• A single-stuck-at fault in sequential machine appears to be the multiple-stuck-at fault

Time-Frame Expansion


• Automatic test pattern generator (ATPG) generates a set of test vectors

Boolean network (combinational ATPG)

Sequential machine (sequential ATPG)

• D (from Discrepancy) allows us to quickly write fault

D value on a node means that good and faulty circuits have different values at that point

• If a test for a particular fault exists, D-algorithm will find it by an exhaustive search of all sensitized paths

Start at the faulty gate

Suppose initially a stuck-at fault on gate output

“Primitive D-cube of failure” (PDCF) of gate summarizes minimal assignment of input values to highlight fault

• Propagation D-cube (PDC) has D or D’ on output and on at least one input

Summarizes “non-controlling” values for other inputs to allow propagation of D signal

Test Pattern Generation


• PODEM stands for Path-Oriented DEcision Making

Circuit-based, fault-oriented ATPG algorithm

• Goal

Propagate D value to primary outputs

• Signal values are explicitly assigned at primary inputs only

Other values are computed by implication

• Backtracking means reassigning primary inputs when a contradiction occurs

Uses implicit enumeration

• Uses five values: 0, 1, D, D’, and X

Start all values at X

• In worst case, must examine all possible inputs

Can be implemented to run quickly

PODEM Algorithm


Fault Propagation Example


• Includes on-chip machine responsible for

Generating tests

Evaluating correctness of tests

• Allows many tests to be applied

• Can’t afford large memory for test results

Rely on compression and statistical analysis

• Uses a linear-feedback shift register (LFSR) to generate a pseudo-random sequence of bit vectors

Built-In Self-Test (BIST)


• One LFSR generates test sequence

• Another LFSR captures/compresses results

• Can store a small number of signatures which contain expected compressed results for valid system

• Usually used for testing memory blocks

BIST Architecture


Layout Generation

• Layout Generation Flow

• Design Rules

• Layout Tools

• Standard Cells

• Floorplanning

• Placement

• Routing

• Clock Tree

• Pads


VerifyVerify

Net-listNet-list

FloorplanningFloorplanning

Plan Power RoutingPlan Power Routing

PlacementPlacement

Adjust SizeAdjust Size

Generate Clock TreesGenerate Clock Trees

Optimize PlacementOptimize Placement

RouteRoute

Post-layout Net-listSDF, DEF, GDSII

Post-layout Net-listSDF, DEF, GDSII

Cell Library(LEF, TLF)

Cell Library(LEF, TLF)

Design Constraints

Design Constraints • Library Exchange Format (LEF) files

To create a library database (standard cells, I/O cells, and macro blocks)

• Timing Library Format (TLF) file

Timing constraints

• General Constraints Format (GCF) file

Design constraints

• Verilog net-list

To create a design database

Layout Generation Flow


• Floorplanning

To create a core area with rows (or columns) and I/O rows around the core area

• Power planning and routing

To plan, modify and rout power paths, power rings and power stripes

• Placement

An I/O constraints file may be used to place the I/O pads

Block placement

Cell placement

• Size adjustment

To estimate the die size

To resize the design to make it routable



• Generating clock trees

The clock buffer space and clock net must be defined

Generating clock trees is iterative process

At this point, the physical net-list differ from the logical (original) net-list

• Placement optimization

To resize gates and insert buffers to correct timing and electrical violations

• Routing

To perform both global and final route on a placed design

• Verification

To check for shorts and design rule violations



• Masks are tools for manufacturing

• Manufacturing processes have inherent limitations in accuracy

• Design rules specify geometry of masks which will provide reasonable yields

• Design rules are determined by experience

• MOSIS SCMOS

Designed to scale across a wide range of technologies

Designed to support multiple vendors

Designed for educational use

Fairly conservative

• Lambda () design rules

Size of a minimum feature defines

Specifying particularizes the scalable rules

Parasitics are generally not specified in units

Design Rules


metal 36

metal 23

metal 13

pdiff/ndiff3

poly2

Wires


2

3

1

3 2

5

Transistors


• Types of via

Metal1/diff

Metal1/poly

Metal2/metal1

Metal3/metal2

...

4

1

4

2

• Highest via

Cut: 3 x 3

Overlap by metal2: 1

Minimum spacing: 3

Minimum spacing to via1: 2

Vias


• Diffusion/diffusion

3

• Poly/poly

2

• Poly/diffusion

1

• Via/via

2

• Metal1/metal1

3

• Metal2/metal2

4

• Metal3/metal3

4

Spacings


• Cut in passivation layer

Connection for bonding wire

• Minimum bonding pad

100

• Pad overlap of glass opening

6

• Minimum pad spacing to unrelated metal2/3

30

• Minimum pad spacing to unrelated metal1, poly, active

15

Overglass


• Layout editors are interactive tools

• Design rule checkers identify errors on the layout

• Circuit extractors extract the net-list from the layout

• Connectivity verification systems (CVS) compare extracted and original net-lists

CADENCE Virtuoso’s Layout-versus-Schematic (LVS) tool

• Standard cell layouts are created from pre-designed cells using the custom routing

Silicon Ensemble (CADENCE)

Encounter (CADENCE)

Physical Compiler (SYNOPSYS)

Layout Tools


routing area

routing area

rout

ing

are

a

routin

g area

• Layout made of small cells

Gates, flip-flops, etc.

Cells are hand-designed

• Assembly of cells is automatic

Cells arranged in rows

Wires routed between and through cells

• Pitch is the height of a cell

All cells have same pitch, may have different widths

• VDD/VSS connections are designed to run through cells

• A feedthrough area allows wires to be routed over the cell

Standard Cell Layout

VSS

n tub

p tub

Intra-cell wiring

pullups

pulldowns

pin

pin

Fee

dthr

ough

are

aVDD


• Floorplanning must take into account

Blocks of varying function, size, and shape

Space allocation

Signal routing

Power supply routing

Clock distribution

Floorplanning Strategy

MIPS Logic

Analog RAM

Test Logic

Processor Logic

Analog RAM


• Develop a wiring plan

Think about how layers will be used to distribute important wires

• Draw separate wiring plans for power and clocking

These are important design tasks which should be tackled early

• Sweep small components into larger blocks

A floorplan with a single NAND gate in the middle will be hard to work with

• Design wiring that looks simple

If it looks complicated, it is complicated

• Design planar wiring

Planarity is the essence of simplicity

Do it where feasible (and where it doesn’t introduce unacceptable delay)

Floorplanning Tips


• Placement of components interacts with routing of wires

• Quality metrics for layout

Area and delay

• Area and delay determined in part by

Wiring

• How do we judge a placement without wiring?

Estimate wire length without actually performing routing

bad placement good placement

Placement Metrics


• To construct an initial solution

• To improve an existing solution

• Pairwise interchange is a simple improvement metric

Interchange a pair, keep the swap if it helps wire length

Heuristic determines which two components to swap

• Placement by partitioning

Works well for components of fairly uniform size

Partition net-list to minimize total wire length using min-cut criterion

• Kernighan-Lin Algorithm

Computes min-cut criterion, count total net-cut change

Exchanges sets of nodes to perform hill-climbing finding improvements where no single swap will improve the cut

Recursively subdivide to determine placement detail

Placement Techniques


• Major phases in routing

Global routing assigns nets to routing areas

Detailed routing designs the routing areas

• Net ordering determines quality of result

Net ordering is a heuristic

• Blocks and wiring

Blocks divide wiring area into routing channels

Large wiring areas may force rearrangement of block placement

• Channel routing

Channel grows in one dimension to accommodate wires

Pins generally on only two sides

• Switchbox routing

Box cannot grow in any dimension

Pins are on all four sides

Routing

chan

nel

switch

box

channel


• Tracks form a grid for routing

Spacing between tracks is center-to-center distance between wires

Track spacing depends on wire layer used

• Density (vertical and horizontal)

Gives the number of wire segments crossing a vertical/horizontal grid segment

• Different layers are used for horizontal and vertical wires

Horizontal and vertical wires can be routed relatively independently

• Placement of cells determines placement of pins

• Pin placement determines difficulty of routing problem

Routing Channels


• Assumes one horizontal segment per net

• Sweep pins from left to right

Assign horizontal segment to lowest available track

• Limitations

Some combinations of nets require more than one horizontal segment per net (a dog-leg wire)

• Aligned pins form vertical constraints

Wire to lower pin must be on lower track

Wire to upper pin must be above lower pin’s wire

A B C

A B B C

B A

A B

aligned

?

Left-Edge Algorithm


• Global routing

Assign wires to paths through channels

Don’t worry about exact routing of wires within channel

Can estimate channel height using congestion

• Detailed routing

Dog-leg router breaks net into multiple segments as needed

Minimize number of dog-leg segments per net to minimize congestion for future nets

Use left-edge criterion on each dog-leg segment to fill up the channel

Global and Detailed Routing


• Multipoint nets are harder to design than two-point nets

Rectilinear Steiner tree problem:

Find the minimum-length tree interconnecting all net’s points

Find additional points (so-called Steiner points) in the plane, if they contribute to a shorter tree length

• Steiner tree algorithm

Compute the spanning tree

Optimize the tree length by flipping L-shaped branches

Steiner points always have degree three

Multipoint Nets


• Place and route to minimize

Capacitance of nodes with high glitching activity

• Feed back wiring capacitance values

To better estimate power consumption

• Size wires to be able to handle current

Requires designing topology of VDD/VSS networks

• Keep power network in metal

Requires designing planar wiring

Layout for Low Power


• Clock delay varies with position

Deliver clock to memory elements with acceptable skew

Deliver clock edges with acceptable sharpness

• Clocking network design

The greatest challenge in the design of a large chip

Clock Delay


• Clocks are generally distributed via wiring trees

• Use low-resistance interconnect to minimize delay

• Use multiple drivers to distribute driver requirements

Use optimal sizing principles to design buffers

• Clock lines can create significant crosstalk

Clock Distribution Tree


• Pads are placed on top-layer metal

Provide a place to bond to the package

• Pads are typically placed around periphery of chip

Some advanced packaging systems bond directly to package without bonding wire

Some allow pads across entire chip surface

• Supply power/ground to

Each pad

Chip core

• Positions of pads

May be determined by pin requirements

• Distribute power/ground pins as evenly as possible

To minimize power distribution problems

Pad Architecture


• Main purpose is to provide electrostatic discharge (ESD) protection

• Gate of transistor is very sensitive

Can be permanently damaged by high voltage

Static electricity in room is sufficient to cause damage

• Resistor is used in series with pad to limit current caused by voltage spike

• May use parasitic bipolar transistors to drain away high voltages

One for positive pulses

Another for negative pulses

• Must design layout to avoid latch-up

Input Pad


• Don’t need ESD protection

Transistor gates not connected to pad

• Must be able to drive capacitive pad load and outside load

• May need voltage level shifting

To be compatible with other logic families

Output Pad


• Combination of input and output

Controlled by a mode-input on chip

• Pad includes logic to disconnect output driver when pad is used as an input

• Must be protected against ESD

Three-State Pad


Design for Manufacturability

• Defects and Faults

• Critical Area Modeling

• Critical Area Extraction

• Yield Modeling

• Yield Control


• Design faults

• Random faults

Shorts between lines

Breaks of lines

Leakage of insulation layers

• Permanent faults

• Dynamic faults

• Transient faults

• Parametric test

Power consumption

Input resistance

Input/output current

• Functional test

Test signals applied

Output observed

Defects, Faults, and Tests


• Defect density distribution

g(D)

• Defect size distribution

h(X)

Defects Statistics


D

g(D)

2DD

1

2D

1

D

Defect Density Distribution


Defect Size Distribution


• Sample size

Number of measurements

• Structure size

High and low defect densities

• Critical dimensions

As defect sizes

• Self-isolation

Sensitive on one defect type

• Measurability

Electrical

Simple

Reliable

Test Structures


Long parallel conductors Real Patterns

Critical Area Models


0

dx)x(h)x(AA

xws

wsxs

sx

for

for

for

swNL

sxNLxs

A

2

2

00

xsw

swxw

wx

for

for

for

wsNL

wxNLxo

A

2

2

00

Critical Area Models

)(

)/xexp(x)x(h

1


Point defects Lithographic defects

)yy)(xx(A l 1212

)]yy()xx[(zA v 12122

Critical Area Extraction


Shorts BreaksVerticalshorts

Critical Area Extraction


0

dDDgDAexpY

Yield Models


Yield Control


• Design Methodology

• SOC Design Flow

• An Example

SOC Example


• Every company has its own design methodology

• Methodology depends on

Size of chip

Design time constraints

Cost/performance

Available tools

• Driven by contradictory impulses

Customer concerns about cost and performance

Forecasts of feasibility of cost and performance

• Design features, performance, power, etc.

May be negotiated at early stages

Negotiation at later stages creates problems

Design Methodology


• Design styles

Full custom logic design (very tedious)

Semi custom or standard cell design

Field Programmable Gate Array (FPGA) design

• Full custom design

Most likely for data-paths

Least likely for random logic off critical path

• Standard cell design

Application Specific Integrated Circuits (ASIC)

• FPGA design

Prototyping oriented

Design Styles


• Functionality, speed, power consumption, area, etc.

• Estimation techniques of circuit performance vary with module

Memories may be generated once size is known

Data-paths may be estimated from previous design

Controllers are hard to estimate without details

• Clock distribution

• Layout design rule check

• Testing

Generation of simulation test vectors

Generation of scan test vectors (ATPG)

Manufacturing test vectors comprise of both

Design Validation


MPL Library

New RTL Designs

HDL Top Module Definition

Simulation

OK?

Logic Synthesis

Simulation

OK?

Layout Synthesis

Simulation

OK?

Final Chip Layout

Test Benches

yes

Configurable Modules(synthesizable RTL code)

Pre-defined Modules(synthesized net-lists, layouts, standard cells)

New logic runsufficient?

yes

yes

New layout runsufficient?

yes

no

no

no

no

yes

no

Applications

System Specification

A

A

A

SOC Design Flow


• Wireless Broadband Networks

A highly integrated broadband wireless modem

• Wireless Internet

A new terminal oriented TCP/IP for wireless systems

RFBasebandDLC

ApplicationEngine

Mobile Bus.Engine

ProtocolEngine

WirelessInternet

PowerManagement

WirelessInternet

TestEngine TestProject

Wireless Broadband Network

• Wireless Sensor Networks

A flexible sensor network node architecture for medical applications

• Mobile Business Engine

A specific application processor for highly efficient encryption operations

Applications


• SOC designs are a mix of

Intellectual Property (IP) blocks

Standard functions (in-house developed)

Application specific blocks (in-house developed)

• RTL descriptions, net-lists and layouts

Soft-core MIPS32 4KEp

Soft-core LEON-2

Hard-core LEON-3

Soft-core IPMS430

UART, GPIO, PCMCIA

AMBA, I2C bus

Controllers

Hardware accelerators

SRAM memory generator

Hard-core flash

Reusable Modules


Area (mm2) 27

Transistors (x106) 1.5

Memory including BIST (kB) 20

Power/Frequency (mW/MHz) 8.1

Max Frequency (MHz) 70

Scan Chain (#FF) 11203

CPU

I-Cache

D-Cache

AHB Controller

DCLDSU

AHB

AHB/APBBridge

Memory Controller

APB

SRAM Flash

IO PortIrq Ctrl

TimersUARTs

CPU

I-Cache

D-Cache

AHB Controller

DCLDSU

AHB

AHB/APBBridge

Memory Controller

APB

SRAM Flash

IO PortIrq Ctrl

TimersUARTs

Configurable Processor


• Customisation addresses three architectural levels

Instruction extension: designer specifies its functionality

Inclusion/exclusion of predefined blocks: special registers, BIST, etc.

Parameterisation: cache size, number of registers, etc.

• To customise the extensible processor to a specific application

It starts with profiling the application using instruction-set simulator

• To evaluate customisations using retargetable tool generation

Retargetable techniques automatically generate compilers and simulators aware of the new or extended instructions

• Major players in the field

Tensilica, Improv Systems, ARC, Coware, and Target Compiler Techn.

NEC’s TCP/IP offload engine integrates 10 extensible processors

Extensible Processor


• “The crisis of complexity” in SOC design

SOC designs do not exploit the possible 100 M transistors per chip

250 M gates per SOC are feasible by 2005, most SOCs would use only 50 M gates

• More design starts for Application-Specific Standard Products (ASSP) than for ASIC

Increase of the number of programmable processing units

Decrease of the gate count for custom logic blocks

• The trend is expected to continue, yielding a “sea of processors”

Many heterogeneous processors connected by a network-on-chip

• System-level design tools combined with use of off-the-shelf components are needed

Application-Specific Instruction-set Processors (ASIP) from ASSP

Extensible processor platform is state-of-the-art in ASIP technology

SOC Design Trends


• Optimisation and search through a large design space

Right set of extensible instructions and its constraints

• Communication of many extensible SOC processors on chip

A customised Network-on-Chip (NOC)

• Shift in SoC design distinction

To the process of customising extensible processors

To the software design expertise needed to program them

• “Structured” ASICs introduce pre-built blocks (logic, configurable memories, and test structures)

Most of the metal layers predefined

Customisation of the upper two or three metal layers

All customers share the same prefabricated die

Open Issues and Alternatives

DAAD Program Akademischer Neuaufbau Südosteuropa Embedded System Design Course SOC Design: From...

Documents

Transcript of DAAD Program Akademischer Neuaufbau Südosteuropa Embedded System Design Course SOC Design: From...