Post on 04-Jan-2016
description
The Limits of Semiconductor The Limits of Semiconductor Technology amp Coming Technology amp Coming
Challenges in Challenges in Microarchitecture and Microarchitecture and
ArchitectureArchitecture
Mile Stojčev Teufik Tokić Ivan Milentijević
Faculty of Electonic Engineering Niš
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline ndash Technology TrendsOutline ndash Technology Trends
bullMoorersquos Law 1Moorersquos Law 1bullMoorersquos Law 2Moorersquos Law 2bullPerformance and New Technology GenerationPerformance and New Technology GenerationbullTechnology Trends ndash ExampleTechnology Trends ndash ExamplebullTrends in FutureTrends in FuturebullProcessor TechnologyProcessor TechnologybullMemory TechnologyMemory Technology
Moores Law 1Moores Law 1In 1965 Gordon Moore director of research and development at Fairchild Semiconductor later founder of Intel corp wrote a paper for Electronics entitled ldquoCramming more components onto integrated circuitsrdquo In the paper Moore observed that ldquoThe complexity for minimum component cost has increased at a rate of roughly a factor of two per yearrdquo
This observation became known as Moores law
In fact by 1975 the leading chips had maybe one-tenth as many components as Moore had predicted The doubling period had stretched out to an average of 17 months in the decade ending in 1975 then slowed to 22 months through 1985 and 32 months through 1995 It has revived to a now rel atively peppy 22 to 24 months in recent years
Moorersquos Law 1 continueMoorersquos Law 1 continueSimilar exponential growth rates have occurred for other aspects of computer technology ndash disk capacities memory chip capacities and processor performance These remarkable growth rates have been the major driving forces of the computer revolution
Capacity Speed (latency)Logic 2x in 3 years 2x in 3 yearsDRAM 4x in 3 years 2x in 10 yearsDisk 4x in 3 years 2x in 10 years
Moorersquos Law 1 ndash number of Moorersquos Law 1 ndash number of transistorstransistors
Moorersquos Law 1 - LinewidthsOne of the key drivers behind the industries ability to double transistor counts every 18 to 24 months is the continuous reduction in linewidths Shrinking linewidths not only enables more components to fit onto an IC (typically 2x per linewidth generation) but also lower costs (typically 30 per linewidth generation)
Moorersquos Law 1 - Die sizeShrinking linewidths have slowed the rate of growth in die size to 114x per year versus 138 to 158x per year for transistor counts and since the mid nineties accelerating linewidth shrinks have halted and even reversed the growth in die sizes
Moores Law in ActionMoores Law in Action
The number of transistors on chip doubles annually
Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline ndash Technology TrendsOutline ndash Technology Trends
bullMoorersquos Law 1Moorersquos Law 1bullMoorersquos Law 2Moorersquos Law 2bullPerformance and New Technology GenerationPerformance and New Technology GenerationbullTechnology Trends ndash ExampleTechnology Trends ndash ExamplebullTrends in FutureTrends in FuturebullProcessor TechnologyProcessor TechnologybullMemory TechnologyMemory Technology
Moores Law 1Moores Law 1In 1965 Gordon Moore director of research and development at Fairchild Semiconductor later founder of Intel corp wrote a paper for Electronics entitled ldquoCramming more components onto integrated circuitsrdquo In the paper Moore observed that ldquoThe complexity for minimum component cost has increased at a rate of roughly a factor of two per yearrdquo
This observation became known as Moores law
In fact by 1975 the leading chips had maybe one-tenth as many components as Moore had predicted The doubling period had stretched out to an average of 17 months in the decade ending in 1975 then slowed to 22 months through 1985 and 32 months through 1995 It has revived to a now rel atively peppy 22 to 24 months in recent years
Moorersquos Law 1 continueMoorersquos Law 1 continueSimilar exponential growth rates have occurred for other aspects of computer technology ndash disk capacities memory chip capacities and processor performance These remarkable growth rates have been the major driving forces of the computer revolution
Capacity Speed (latency)Logic 2x in 3 years 2x in 3 yearsDRAM 4x in 3 years 2x in 10 yearsDisk 4x in 3 years 2x in 10 years
Moorersquos Law 1 ndash number of Moorersquos Law 1 ndash number of transistorstransistors
Moorersquos Law 1 - LinewidthsOne of the key drivers behind the industries ability to double transistor counts every 18 to 24 months is the continuous reduction in linewidths Shrinking linewidths not only enables more components to fit onto an IC (typically 2x per linewidth generation) but also lower costs (typically 30 per linewidth generation)
Moorersquos Law 1 - Die sizeShrinking linewidths have slowed the rate of growth in die size to 114x per year versus 138 to 158x per year for transistor counts and since the mid nineties accelerating linewidth shrinks have halted and even reversed the growth in die sizes
Moores Law in ActionMoores Law in Action
The number of transistors on chip doubles annually
Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Outline ndash Technology TrendsOutline ndash Technology Trends
bullMoorersquos Law 1Moorersquos Law 1bullMoorersquos Law 2Moorersquos Law 2bullPerformance and New Technology GenerationPerformance and New Technology GenerationbullTechnology Trends ndash ExampleTechnology Trends ndash ExamplebullTrends in FutureTrends in FuturebullProcessor TechnologyProcessor TechnologybullMemory TechnologyMemory Technology
Moores Law 1Moores Law 1In 1965 Gordon Moore director of research and development at Fairchild Semiconductor later founder of Intel corp wrote a paper for Electronics entitled ldquoCramming more components onto integrated circuitsrdquo In the paper Moore observed that ldquoThe complexity for minimum component cost has increased at a rate of roughly a factor of two per yearrdquo
This observation became known as Moores law
In fact by 1975 the leading chips had maybe one-tenth as many components as Moore had predicted The doubling period had stretched out to an average of 17 months in the decade ending in 1975 then slowed to 22 months through 1985 and 32 months through 1995 It has revived to a now rel atively peppy 22 to 24 months in recent years
Moorersquos Law 1 continueMoorersquos Law 1 continueSimilar exponential growth rates have occurred for other aspects of computer technology ndash disk capacities memory chip capacities and processor performance These remarkable growth rates have been the major driving forces of the computer revolution
Capacity Speed (latency)Logic 2x in 3 years 2x in 3 yearsDRAM 4x in 3 years 2x in 10 yearsDisk 4x in 3 years 2x in 10 years
Moorersquos Law 1 ndash number of Moorersquos Law 1 ndash number of transistorstransistors
Moorersquos Law 1 - LinewidthsOne of the key drivers behind the industries ability to double transistor counts every 18 to 24 months is the continuous reduction in linewidths Shrinking linewidths not only enables more components to fit onto an IC (typically 2x per linewidth generation) but also lower costs (typically 30 per linewidth generation)
Moorersquos Law 1 - Die sizeShrinking linewidths have slowed the rate of growth in die size to 114x per year versus 138 to 158x per year for transistor counts and since the mid nineties accelerating linewidth shrinks have halted and even reversed the growth in die sizes
Moores Law in ActionMoores Law in Action
The number of transistors on chip doubles annually
Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moores Law 1Moores Law 1In 1965 Gordon Moore director of research and development at Fairchild Semiconductor later founder of Intel corp wrote a paper for Electronics entitled ldquoCramming more components onto integrated circuitsrdquo In the paper Moore observed that ldquoThe complexity for minimum component cost has increased at a rate of roughly a factor of two per yearrdquo
This observation became known as Moores law
In fact by 1975 the leading chips had maybe one-tenth as many components as Moore had predicted The doubling period had stretched out to an average of 17 months in the decade ending in 1975 then slowed to 22 months through 1985 and 32 months through 1995 It has revived to a now rel atively peppy 22 to 24 months in recent years
Moorersquos Law 1 continueMoorersquos Law 1 continueSimilar exponential growth rates have occurred for other aspects of computer technology ndash disk capacities memory chip capacities and processor performance These remarkable growth rates have been the major driving forces of the computer revolution
Capacity Speed (latency)Logic 2x in 3 years 2x in 3 yearsDRAM 4x in 3 years 2x in 10 yearsDisk 4x in 3 years 2x in 10 years
Moorersquos Law 1 ndash number of Moorersquos Law 1 ndash number of transistorstransistors
Moorersquos Law 1 - LinewidthsOne of the key drivers behind the industries ability to double transistor counts every 18 to 24 months is the continuous reduction in linewidths Shrinking linewidths not only enables more components to fit onto an IC (typically 2x per linewidth generation) but also lower costs (typically 30 per linewidth generation)
Moorersquos Law 1 - Die sizeShrinking linewidths have slowed the rate of growth in die size to 114x per year versus 138 to 158x per year for transistor counts and since the mid nineties accelerating linewidth shrinks have halted and even reversed the growth in die sizes
Moores Law in ActionMoores Law in Action
The number of transistors on chip doubles annually
Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moorersquos Law 1 continueMoorersquos Law 1 continueSimilar exponential growth rates have occurred for other aspects of computer technology ndash disk capacities memory chip capacities and processor performance These remarkable growth rates have been the major driving forces of the computer revolution
Capacity Speed (latency)Logic 2x in 3 years 2x in 3 yearsDRAM 4x in 3 years 2x in 10 yearsDisk 4x in 3 years 2x in 10 years
Moorersquos Law 1 ndash number of Moorersquos Law 1 ndash number of transistorstransistors
Moorersquos Law 1 - LinewidthsOne of the key drivers behind the industries ability to double transistor counts every 18 to 24 months is the continuous reduction in linewidths Shrinking linewidths not only enables more components to fit onto an IC (typically 2x per linewidth generation) but also lower costs (typically 30 per linewidth generation)
Moorersquos Law 1 - Die sizeShrinking linewidths have slowed the rate of growth in die size to 114x per year versus 138 to 158x per year for transistor counts and since the mid nineties accelerating linewidth shrinks have halted and even reversed the growth in die sizes
Moores Law in ActionMoores Law in Action
The number of transistors on chip doubles annually
Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moorersquos Law 1 ndash number of Moorersquos Law 1 ndash number of transistorstransistors
Moorersquos Law 1 - LinewidthsOne of the key drivers behind the industries ability to double transistor counts every 18 to 24 months is the continuous reduction in linewidths Shrinking linewidths not only enables more components to fit onto an IC (typically 2x per linewidth generation) but also lower costs (typically 30 per linewidth generation)
Moorersquos Law 1 - Die sizeShrinking linewidths have slowed the rate of growth in die size to 114x per year versus 138 to 158x per year for transistor counts and since the mid nineties accelerating linewidth shrinks have halted and even reversed the growth in die sizes
Moores Law in ActionMoores Law in Action
The number of transistors on chip doubles annually
Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moorersquos Law 1 - LinewidthsOne of the key drivers behind the industries ability to double transistor counts every 18 to 24 months is the continuous reduction in linewidths Shrinking linewidths not only enables more components to fit onto an IC (typically 2x per linewidth generation) but also lower costs (typically 30 per linewidth generation)
Moorersquos Law 1 - Die sizeShrinking linewidths have slowed the rate of growth in die size to 114x per year versus 138 to 158x per year for transistor counts and since the mid nineties accelerating linewidth shrinks have halted and even reversed the growth in die sizes
Moores Law in ActionMoores Law in Action
The number of transistors on chip doubles annually
Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moorersquos Law 1 - Die sizeShrinking linewidths have slowed the rate of growth in die size to 114x per year versus 138 to 158x per year for transistor counts and since the mid nineties accelerating linewidth shrinks have halted and even reversed the growth in die sizes
Moores Law in ActionMoores Law in Action
The number of transistors on chip doubles annually
Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moores Law in ActionMoores Law in Action
The number of transistors on chip doubles annually
Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moorersquos Law 1 ndash MicroprocessorMoorersquos Law 1 ndash Microprocessor
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moorersquos Law 1ndash Capacity Single Moorersquos Law 1ndash Capacity Single Chip DRAMChip DRAM
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Improving frequency via pipeliningImproving frequency via pipelining
Process technology and microarchitecture innovations enable doubling the frequency increase every process generation
The figure presents the contribution of both as the process improves the frequency increases and the average amount of work done in pipeline stages decreases
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Process ComplexityShrinking linewidths isnrsquot free Linewidth shrinks require process modifications to deal with a variety of issues that come up from shrinking the devices - leading to increasing complexity in the processes being used
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)
In 1996 Intel augmented Moorersquos law (the number of transistor on processor double approximately every 18 mounts) with Moorersquos law 2
Law 2 says that as sophistication of chip increases the cost of fabrication rises exponentially
The cost of semiconductor tools doubles every four years By this logic chip fabrication plants or fabs were supposed to cost $5 billion each by the late 1990s and $10 billion by now
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law) - continue - continue
For example In 1986 Intel manufactured 386 that counted 250 000 transistors in fabs costing $200 million In 1996 for Pentium processor that counted 6 million transistors $2 billion facility to produce was needed
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Moorersquos Law 2 (Rockrsquos Law)Moorersquos Law 2 (Rockrsquos Law)The Cost of Semiconductor Tools The Cost of Semiconductor Tools
Doubles Every Four YearsDoubles Every Four Years
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Machronersquos LawMachronersquos LawThe PC you want to bay will always be $5000The PC you want to bay will always be $5000
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Metcalfersquos LawMetcalfersquos Law
A networkrsquos value grows A networkrsquos value grows proportionately to the Number proportionately to the Number of its users squaredof its users squared
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Wirthrsquos LawWirthrsquos Law
Software is slowing faster than Software is slowing faster than hardware is acceleratinghardware is accelerating
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Performance and new Performance and new technology generationtechnology generation
According to the Moorersquos law each new generation has approximately doubled logic circuit density and increased performance by about 40 while quadrupling memory capacity
The increase in component per chip comes from following key factors
The factor of two in component density come from 205 shrink in each lithography dimensions (205 per x and 205 per y) An additional factor of 205 comes from an increase in chip areaA final factor of 205 comes from device and circuit cleverness
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Development in ICsDevelopment in ICs
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Semiconductor Industry Semiconductor Industry Association Roadmap Summary Association Roadmap Summary
for high-end Processorsfor high-end ProcessorsSpecificationyear 1997 1999 2001 2003 2006 2009 2012Feature size (micron) 025 018 015 013 01 007 005Supply voltage (V) 18-25 15-18 12-15 12-15 09-12 06-09 05-06Transistorschip (millions) 11 21 40 76 200 520 1400DRAM bitschip (mega) 167 1070 1700 4290 17200 68700 275000Die size (mm2) 300 340 385 430 520 620 750Global clock freq (MHz) 750 1200 1400 1600 2000 2500 3000Local clock freq (MHz) 750 1250 1500 2100 3500 6000 10000Maximum powerchip (W) 70 90 110 130 160 170 175
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
total transistorschiptotal transistorschip
0
200
400
600
800
1000
1200
1400
1600
025 018 015 013 01 007 005
1997 1999 2001 2003 2006 2009 2012Technology (micron)year
No
of
tra
ns
isto
rs (
mill
ion
s) Transistorschip
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Clock Frequency Versus Year for Clock Frequency Versus Year for Various Representative MachinesVarious Representative Machines
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Limiting in ClockingLimiting in ClockingTraditional clocking techniques will reach their limit when the clock frequency reaches the 5-10 GHz range
For higher frequency clocking (gt10GHz) new ideas and new ways of designing digital systems are needed
0
2000
4000
6000
8000
10000
12000
1997 1999 2001 2003 2006 2009 2012
025 018 015 013 01 007 005Technology (micron)year
Fre
qu
ency
(M
Hz)
Global clock freq (MHz) Local clock freq (MHz)
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Intelrsquos Microprocessors Clock Intelrsquos Microprocessors Clock FrequencyFrequency
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Technology Trends - ExampleTechnology Trends - Example
As an illustration of just how computer technology is improving letrsquos consider what would have happened if automobiles had improved equally quickly
Assume that an average car in 1977 had a top speed of 150 kmh and an average fuel economy of 10 kml If both top speed and efficiency improved at 35 per year from 1977 to 1987 and by 50 per year from 1987 to 2000 tracking computer performance what would the average top speed and fuel economy of car be in 1987 In 2000
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
SolutionSolutionIn 1987The span 1977 to 1987 is 10 years so both traits would have improved by factor of (135)10 = 201 giving a top speed of 3015 kmh and fuel economy of 201 kml
In 2000 Thirteen more years elapse this time at a 50 per year improvement rate for a total factor of (15)13 = 1946 over the 1987 values This gives a top speed of 586 719 kmh and fuel economy of 39 1146 kml This is fast enough to cover the distance from the earth to the moon in under 39 min and to make round trip on less than10 liters of gasoline
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Future size versus time in Future size versus time in silicon ICssilicon ICs
The semiconductor industry itself has developed a ldquoroadmaprdquo based on the idea of Moorersquos law
The National Roadmap for Semiconductors (NTRS) and most recently the International Technology Roadmap for semiconductors (ITRS) now extend the device scaling and increased functionality scenario to the year 2014 at which point minimum future size are projected to be 35 nm and chips with gt 1011 components are expected to be available
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Trends in future size over time
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Processor technology todayProcessor technology today
The most advanced processor technology today (year 2003) is 010 m=100nm
Ideally processor technology scales by a factor of ~07 all physical dimensions of devices (transistors and wires)
With such scaling typical improvement figures are the followingbull 14 ndash 15 times faster transistorsbull two times smaller transistorsbull 135 times lower operating voltagesbull three times lower switching power
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Processor Technology and Processor Technology and MicroprocessorsMicroprocessors
Process technology is the most important technology that drives the microprocessor industry
It is characterized by growing 1000 times in frequency (from 1MHz to 1GHz) and integration (from ~10K to 1M devices) in 25 years
Microarchitecture attempts to increase both IPC and frequency
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Process technology and Process technology and microarchitecturemicroarchitecture
Microarchitecture techniques such as caches branch prediction and out-of-order execution can increase instruction per cycle (IPC)
Pipelining as microarchitecture idea help to increase frequency
Modern architecture (ISA) and good optimizing compiler can reduce the number of dynamic instructions executed for a given program
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Frequency and performance Frequency and performance improvementsimprovements
While in-order microprocessor used four to five pipe stages modern out-of-order microprocessors use over ten pipe stages
With frequencies higher than 1 GHz more than 20 pipeline stages are used
0
2
4
6
8
10
12
14
16
18
20
Pipeline Depth
Rela
tive I
mp
rovem
en
t
Frequency
CPI
Performance
Power
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Performance of memory and CPUPerformance of memory and CPU
Memory in computer system is hierarchically organized
In 1980 microprocessors were often designed without caches
Nowadays microprocessors often come with two levels of caches
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Memory HierarchyMemory Hierarchy
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Processor-DRAM GapMicroprocessor performance improved 55 per year since 1987 and 35 per year until 1986
Memory technology improvements aim primarily at increasing DRAM capacity not DRAM speed
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Relative processormemory speedRelative processormemory speed
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Type of MemoriesType of Memories
MOS memories
RAMs ROMs
DRAMSRAM ROM
FLASHEEPROMEPROM
VOLATILE
Power off contents lost
NON VOLATILE
Power off contents kept
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Percentage of UsagePercentage of Usage
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Typical Applications of DRAMTypical Applications of DRAM
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
An anecdoteAn anecdote
In recent database benchmark study using TPC-C both 200MHz Pentium Pro and 21164 Alpha systems were measured at 42 ndash 45 CPU cycles per instruction retired
IN other words three out of every four CPU cycles retired zero instructions most were spent waiting for memory Processor speed has seriously outstripped memory speed
Increasing the width of instruction issue and increasing the number of simultaneous instruction streams only makes the memory bottleneck worse
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
An anecdote - continueAn anecdote - continue
If a CPU chip today needs to move 2GBytess (say 16 bytes every 8ns) across the pins to keep itself busy imagine a chip in the foreseeable future with twice the clock rate twice the issue width and two instruction streams
All this factors multiply together to require about 16 GBytess of pin bandwidth to keep this chip busy
If is not clear whether pin bandwidth can keep up ndash 32 bytes every 2ns
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Memory systemMemory system
In 1GHz microprocessor accessing main memory can take about 100 cycles Such access may stall a pipelined microprocessor for many cycles and seriously impact the overall performance
To reduce memory stalls at a reasonable cost modern microprocessor take advantage of the locality of references in the program and use a hierarchy of memory components
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Expensive Memory Expensive Memory Called a CacheCalled a Cache
A small fast and expensive (in $bit) memory called a cache is located on ndash die and holds frequently used data
A somewhat bigger but slower and cheaper cache may be located between the microprocessor and the system bus which connects the microprocessor to the main memory
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Two Levels of CachesTwo Levels of Caches
Most advanced microprocessors today employ two levels of caches on chip
The first level is ~ 32 ndash 128kB ndash it takes two to three cycles to access and typically catches about 95 of all accesses
The second level is 256kB to over 1MB ndash it typically takes six to ten cycles to access and catches over 50 of misses of the first level
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Memory Hierarchy Impact on Memory Hierarchy Impact on PerformancePerformance
Of ndash chip memory access may elapse about 100 cycles
The cache miss that eventually has to go to the main memory can take about the same amount of time as executing 100 arithmetic and logic unit (ALU) instructions so the structure of memory hierarchy has a major impact on performance
Cache a made bigger and heuristics are used to make sure the cache contains portions of memory that are most likely to be used in the near future of program execution
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
As conclusion concerning As conclusion concerning memory - problemsmemory - problems
Todayrsquos chip are largely able to execute code faster than we can feed then with instruction and data
There are not longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit
The real design action is in memory subsystems ndash caches busses bandwidth and latency
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
As conclusion concerning As conclusion concerning memory ndash problems continuememory ndash problems continue
If the memory research community would follow the microprocessor communityrsquos lead by learning more heavily on architecture ndash and system level solutions in addition to technology ndash level solutions to achieve higher performance the gap might begin to close
On expect that over the coming decade memory subsystems design will be the only important design issue for microprocessors
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Memory Hierarchy Memory Hierarchy SolutionsSolutions
Organization choices (CPU architecture L1L2 cache organizations DRAM architecture DRAM speed) can affect total execution time by a factor of two
System level parameters most affect performancea) The number of independent channels and banks connecting
the CPU to the DRAMs can effect a 25 performance change
b) Burst ndash width ndash refers to data access granularity can effect a 15 performance change
c) Magnetic RAM (MRAM) ndash new type of memory
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Magnetic RAM Magnetic RAM ndash MRAM or ndash MRAM or NanotechRAM (NRAM)NanotechRAM (NRAM)
Based on nanoscale semiconductor technology
Nanotechnology RAM device consists of tiny Carbon nanotubes
Differing electrical changes swing the tubes into one of two positions representing the ones and zeroes necessary for digital storage Moreover the tubes stay in position until a new signal resets them
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
MRAM CapacityMRAM Capacity
The 10 Gbit devices consists of carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM is nonvolatile it has the fast read-and-write perfotrmance of static RAM (SRAM)
The 10 Gbit devices consists of 10 Billions carbon nanotubes that are 1 nm ndash just a few thaunsand atoms ndash in diameter on a silicon wafer
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
MRAM as Universal MemoryMRAM as Universal MemoryMRAM can replace many ofthers types of memory including SRAM DRAM ROM EEPROM Flash EEPROM and feroelectric RAM (FRAM) Prediction are crystalline structures that users grow on silicon
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Capacity of DRAM and Capacity of DRAM and FLASH - MRAMFLASH - MRAM
100nm
90nm
55nm
70nm
0512Mb
2Gb
4GbMLC(2bitscell)
SLC
1Gb
4Gb
8Gb
16Gb
MLC(3bitssell)
0
20
40
60
80
100
120
2003 2005 2010
Des
ign
Ru
le [
nm
]
01
1
10
100
Den
sity
[G
b]
FLASH
DRAM
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Surpassing the Prediction from Surpassing the Prediction from Moores Low ndash DRAM vs MRAMMoores Low ndash DRAM vs MRAM
The famous Moores Low predicts that the memory density will be doubled in 15 years while the new growth model clearly indicates the doubling of NAND Flash memory density every year
100
4Gb
2Gb
1Gb
0512Mb
12Gb
4Gb
2Gb
1Gb
01
1
10
100
2000 2005 2010
De
ns
ity
[G
b]
2 fold density per year FLASH
DRAM
Moores Law
MLC (3bitscell)
MLC (2bitscell)
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Overall memory prediction roadmapOverall memory prediction roadmap
Even though the density growth of DRAM will slow down DRAM will still keep on leading the overall memory technology and will be able to reach 8 Gb density in ten years
High-density memory growth will surpass the prediction from Moores Low
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Overall memory prediction roadmap -contOverall memory prediction roadmap -cont
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
1988 Computer Food Chain1988 Computer Food Chain
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
1997 Computer Food Chain1997 Computer Food Chain
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
2003 Computer Food Chain2003 Computer Food Chain
M a infra m e
Sup e rc o m p ute r
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Outline - Low Power DesignOutline - Low Power Design
bullPower trends in VLSIPower trends in VLSIbullView Point on PowerView Point on PowerbullResearch Efforts in Low Power DesignResearch Efforts in Low Power DesignbullIs there an Optimal Design PointIs there an Optimal Design Point
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
During 1995 energy consumption of all PC machines installed in USA was 60 106 MWh
bullDuring 2000 energy consumption of all PC machines installed in USA was 10 of the total energy production
bullDuring 2015 on except that the energy consumption of all PC machines will be 15 greater then 1995 or 69106 MWh
Power consumptionPower consumption
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
bullbattery operated equipmentsbullmobile communication equipmentsbullwireless communication equipmentsbullinstrumentation bullconsumer electronics bullbiomedical technologies bullindustry bullprocess controls
Typical Low-Power Applications
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
ldquoCMOS Circuits dissipate little power by nature So believed circuit designersrdquo(Kuroda-Sakurai 95)
ldquoBy the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages even if the supply voltage can be feasibly reducedrdquo
95908580001
01
1
10
100P
ow
er (
W)
x4 3years
Power dissipation in timePower dissipation in time
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Gloom and Doom predictionsGloom and Doom predictions
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Power density will increasePower density will increase
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Year
Vo
ltag
e [V
]
Po
wer
per
ch
ip [
W]
VD
D c
urr
ent
[A]
VDD Power and Current TrendVDD Power and Current Trend
1998 2002 2006 2010 20140
05
1
15
2
25
0 0
200 500
Current
Power
Voltage
International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) Electronic Industries Association of Japan (EIAJ) Korea Semiconductor Industry Association (KSIA) and Taiwan Semiconductor Industry Association (TSIA)
( Taken from Sakurairsquos ISSCC 2001 presentation)
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Power Delivery Problem (not just Power Delivery Problem (not just California)California)
Your carstarter
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Power Consumption New Power Consumption New Dimension in DesignDimension in Design
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Sources of Power Sources of Power ConsumptionConsumption
The three major sources of power consumption in digital CMOS circuits are
21 2 3avg t L dd clk sc dd leakage ddP p C V f I V I V P P P
where
P1 ndash capacitive switching power (dynamic - dominant)
P2 ndash short circuit power (dynamic)
P3 ndash leakage current power (static)
P4 ndash static power dissipation (minor)
+ P4
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Research Efforts in Low-Power DesignResearch Efforts in Low-Power Design
Psw = pt CL V2
dd fCLK
Reduce Switching ActivitybullConditional clockbullConditional prechargebullSwitching-off inactive blocksbullConditional execution
Run it slowerbullUse parallelismbullLess pipeline stagesbullUse double-edge flip-flop
Technology scalingbullThe highest winbullThresholds should scalebullLeakage starts to bytebullDynamic voltage scaling
Reduce the active loadbullMinimize the circuitsbullUse more efficient designbullCharge recycling bullMore efficient layout
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Reducing the Power Reducing the Power DissipationDissipation
The power dissipation can be minimized by reducing
supply voltageload capacitanceswitching activity
ndash Reducing the supply voltage brings a quadratic improvement
ndash Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Amount of Reducing the Power Amount of Reducing the Power DissipationDissipation
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Gate Delay and Power Dissipation Gate Delay and Power Dissipation in Term of Supply Voltagein Term of Supply Voltage
06 30 50
Supply voltage [ V ]
Ga
te d
elay
[n
s]
(no
rma
lize
d)
Po
we
r d
issi
pat
ion
[ W
](n
orm
ali
zed
)
1
1 02 5
1
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
bull Efficient methodologies and technologies for the design of high-throughput and low-power digital systems are needed
bull The main interest of many researches is now oriented towards lowering the energy dissipation of these systems while still maintaining the high-throughput in real time processing
Needs for Low-Power
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Low-Power Design Low-Power Design Techniques Techniques
The basic idea is
Decreasing activity of the some parts within VLSI IC
The term power manager refer to such techniques in general
Applying power management to a design typically involves two steps
a) identifying idle or low active conditions for various parts of the circuit and
b) redesigning the circuits in order to eliminate or decrease switching activity in idle or low-active components
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
a) Reduction in fCLK is an option acceptable when
some components may be idle or low-active during operation
b) Reduction in Vdd is the most effective way for
power reduction since the power is proportional to the square of Vdd The problem with reducing
Vdd is that it leads to an increase in circuit delay
c) The product ptCL is called the average switched
capacitance per cycle and the main directions for reducing this capacitance are done at system- architectural- RTL- circuit- or technology level
General Approaches to Reduce PowerGeneral Approaches to Reduce Power
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Low Power and Low Energy System DesignLow Power and Low Energy System Design
higher impactmore options
AlgorithmLevel
ArchitectureLevel
CircuitLevel
Process DeviceLevel
SystemLevel Design partitioning Power Down
Complexity Concurrency LocalityRegularity Data representation
Voltage scaling ParallelismInstruction set Signal correlations
Transistor sizing Logic optimizationActivity Driven Power Down Low-swing logic Adiabatic switching
Threshold Reduction Multi-threshold
The design of low power circuits can be tackled at different levels from system to technology
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Multiple Frequency on the Chip as Multiple Frequency on the Chip as Technique to Reduce PowerTechnique to Reduce Power
Less aggressive approach is which attracts more attention
This technique is standardly used in VLSI ICs in order to reduce the power dissipation while maintaining the operating
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
f11 f12
PLL 1 DLL 1
PLL
fCLK
f21 f21
PLL 2 DLL 2
f31 f32
PLL 3 DLL 3
f41 f41
PLL 4 DLL 4
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Energy Minimization Using Energy Minimization Using Multiple FrequencyMultiple Frequency
phasedetector
curentpump
divider by N
up
down
CLKREF
CLKFB
VCO
loopfilter
digital system
regulated voltage
clock distribution ampfrequency multiplier
logic
phasedetector
curentpump
up
down
CLKREF
CLKFBloopfilter
digital system
f1 2f1 nf1
control
VCDL
TVCDL
DC1 DC2 DCn
in out
PLL based
DLL based
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Clock GatingClock Gating amp Clock Distribution as amp Clock Distribution as Techniques to Reduce PowerTechniques to Reduce Power
DFF
D
C
Q
enable
clock
gated-clock
target flip-flops
latch
clock
enable
gated-clock
activated deactivated
D C
BA
Enable_BEnable_A
Enable_C
PLL(Clk generator)
Clk
Clock distributionClock gating
- The use of gated clock is the most common approach to reduce energy Unused modules are turned off by suppressing the clock to the module
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Energy Minimazation Using Energy Minimazation Using Multiple Supply VoltageMultiple Supply Voltage
bull Multiple supply voltage on the chip as less aggressive approach is attracting attention
bull This has the advantage of allowing modules on the critical paths to use the highest voltage level (thus meeting the required timing constraints) while allowing modules on noncritical paths to use lower voltages (thus reducing the energy consumption)
bull This scheme tends to result in smaller area overhead compared to parallel architectures
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
System Level Dynamic Power Management System Level Dynamic Power Management as another Techniques to Reduce Poweras another Techniques to Reduce Power
Dynamic power management is design methodology that dynamically reconfigures an electronic system to provide the requested services and performance levels with a minimum number of active components or a minimum load on such components
RUN
SLEEPIDLE~90s
~10s 160ms
Wait for interrupt Wait for wake-up event
P=400mW
~90s~10s
P=50mW P=016mW
OBSERVER CONTROLLER
Workloadinformation
Power Manager
SYSTEM
Observations Commands
Power Manager Power State Machine
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Power Breakdown in High-Performance Power Breakdown in High-Performance CPU and Dynamic Instruction StatisticsCPU and Dynamic Instruction Statistics
12
16
25
43
Clock
Memory
Control IODatapath
1513
51
23
43
Compare op
Logical op
Others
Data Move
Control Flow
Arithmetic op
Power breakdown Dynamic instruction statistics
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Architecture Trade-offs ndash Architecture Trade-offs ndash Reference DatapathReference Datapath
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Parallel DatapathParallel Datapath
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
The More Parallel the BetterThe More Parallel the Better
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Pipeline DatapathPipeline Datapath
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Architecture Summary for a SimpleArchitecture Summary for a Simple
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Outline Outline Microprocessorsrsquo Microprocessorsrsquo GenerationsGenerations
bullFirst generation 1971-78
bullSecond Generation 1979-85
bullThird Generation 1985-89
bullFourth Generation 1990-
ndashBehind the power curve
ndashBecoming ldquorealrdquo computers
ndashChallenging the ldquoestablishmentrdquo
ndashArchitectural and performance leadership
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
The microprocessor todayThe microprocessor today When we say ldquomicroprocessorrdquo today we generally mean the shaded area of the figure
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
The First Generation 1971-78The First Generation 1971-78
Getting enough bits and transistors Transistor counts lt 50000 Performance lt 05 MIPS Architecture 8-16 bits
ndash Narrow datapaths (= slow performance)ndash Awkward architecturesndash Assembly language + some BASIC
Processorsndash Intel 4004 8008 8080 8086ndash Zilog Z-80ndash Motorola 6800 6502
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Intel 4004Intel 4004 First general-purpose
single-chip microprocessor Shipped in 1971 8-bit architecture 4-bit
implementation 2300 transistors Performance lt 01 MIPS 8008 8-bit implementation
in 1972ndash 3500 transistorsndash First microprocessor-based
computer (Micral) Targeted at laboratory
instrumentation Mostly sold in Europe
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Intel 8080Intel 8080 Intelrsquos first 16-bit architecture
ndash Delivered in 1974ndash 4800 transistorsndash Performance lt 02 MIPS
Used in Altair 8800 system ndash Kit form (advertised in Popular
Electronics) in 1975 $297 or $395 with case 256 bytes of memory
expandable to 64K Keyboard and floppy 100-line bus becomes S-100
first microcomputer busndash Gates amp Allen write BASICndash Wozniak builds one
Homebrew Computer Club
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Intel 8086Intel 8086
Introduced in 1978ndash Performance lt 05 MIPS
New 16-bit architecturendash ldquoAssembly languagerdquo
compatible with 8080ndash 29000 transistorsndash Includes memory protection
support for FP coprocessor In 1981 IBM introduces
PC ndash Based on 8088--8-bit bus
version of 8086
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Second Generation 1979-85Second Generation 1979-85 Becoming ldquorealrdquo computers
ndash First 32-bit architecture (68000)ndash First virtual memory supportndash Workstations Macs and PCs based on microprocessors
Transistors gt50000 Performance lt= 1 MIPS Processors
ndash Motorola 68000 68020ndash Intel 80286 80386
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Motorola 68000Motorola 68000 Major architectural step in
microprocessorsndash First 32-bit architecture
initial 16-bit implementation
ndash First flat 32-bit address Support for paging
ndash General-purpose register architecture
Loosely based on PDP-11
First implementation in 1979ndash 68000 transistorsndash lt 1 MIPS
Used inndash Apple Macndash Sun Silicon Graphics amp Apollo
workstations
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Third Generation 1985-89Third Generation 1985-89 Challenging the ldquoestablishmentrdquo
ndash Microprocessors surpass minicomputers in performance rival mainframes
ndash Implementation technology of choice all new architectures are microprocessors
ndash RISC architecture techniques take hold Transistors lt 500K Performance gt 5 MIPS Processors
ndash MIPS R2000 R3000ndash Sun SPARCndash HP PA-RISC
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
MIPS R2000MIPS R2000 Several firsts
ndash First RISC microprocessor
ndash First microprocessor to provide integrated support for instruction amp data cache
ndash First pipelined microprocessor (sustains 1 instructionclock)
Implemented in 1985ndash 125000 transistorsndash 5-8 MIPS
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Fourth Generation 1990-Fourth Generation 1990- Architectural and performance leadership
ndash First 64-bit architecturendash First multiple-issue machinendash First multilevel caches
Transistors gt1M Clock ratesgt 100MHz Performance gt 50 MIPS Processors
ndash Intel i860 Pentium MIPS R4000 MIPS R1000 DEC Alpha Sun UltraSPARC HP PA-RISC PowerPC
Generation 45 ndash same basic approach but faster clock rates amp wider issuendash Alpha 21264 Pentium III amp 4 Intel Itanium
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Key Architectural TrendsKey Architectural Trends Increase performance at 16x per year
ndash True from 1985-present Combination of technology and architectural
enhancementsndash Technology provides faster transistors and more of themndash Faster transistors leads to high clock ratesndash More transistors
Architectural ideas turn transistors into performancendash Responsible for about half the yearly performance growth
Two key architectural directionsndash Sophisticated memory hierarchiesndash Exploiting instruction level parallelism
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Memory HierarchiesMemory Hierarchies Caches hide latency of DRAM and increase BW
ndash CPU-DRAM access gap has grown by a factor of 30-50 Trend 1 Increasingly large caches
ndash On-chip from 128 bytes (1984) to 100K+ bytesndash Multilevel caches add another level of caching
First multilevel cache1986 Secondary cache sizes today 128KB to 4-16 MB
Trend 2 Advances in caching techniquesndash Reduce or hide cache miss latencies
early restart after cache miss (1992) nonblocking caches continue during a cache miss (1994)
ndash Cache aware combos computers compilers code writers prefetching instruction to bring data into cache early
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Exploiting ILPExploiting ILP ILP is the implicit parallelism among instructions Exploited by
ndash Overlapping execution in a pipeline ndash Issuing multiple instruction per clock
superscalar uses dynamic issue decision (HW driven) VLIW uses static issue decision (SW driven)
1985 simple microprocessor pipeline (1 instrclock) 1990 first static multiple issue microprocessors 1995 sophisticated dynamic schemes
ndash determine parallelism dynamicallyndash execute instructions out-of-orderndash speculative execution depending on branch prediction
ldquoOff-the-shelfrdquo ILP techniques yielded 20 year path
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
MIPS R4000MIPS R4000 First 64-bit architecture Integrated caches
ndash On-chipndash Support for off-chip secondary
cache Integrated floating point Implemented in 1991
ndash Deep pipelinendash 14M transistorsndash Initially 100MHzndash gt 50 MIPS
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Intel i860Intel i860
First multiple issue microprocessor ndash 2 instructionsclockndash Dual issue mode ndash Novel push pipelinendash Novel cache bypass
Implemented in 1991ndash 13M transistorsndash 50 mips
Used primarily as attached processor (eg graphics)
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
MIPS R10000MIPS R10000
First speculative processorndash Instruction scheduled and
executed out-of-orderndash Up to 4 instructions can
complete per clockndash Window of 32 instructions
(up to 32 in-flight)ndash Maintain precise state by
completing instructions in order
Implemented in 1996ndash 68M transistorsndash 200 MHz
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Intel IA-64 and ItaniumIntel IA-64 and Itanium
EPIC architecturendash Use compiler centric approach
while avoiding disadvantagesndash Parallelism demarcated by the
compilerndash Many special instruction amp
features for exploiting ILP in the compiler
Itaniumndash First implementation (2001) ndash 25 M transistorsndash 800 MHzndash 130 Watts
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Breakdown of tasks between Breakdown of tasks between compiler and runtime hardwarecompiler and runtime hardware
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Todayrsquos Uniprocessor ILP Todayrsquos Uniprocessor ILP MenuMenu
Software Techniquesndash Static schedulingndash Static issue (ie VLIW)ndash Static branch predictionndash Aliaspointer analysisndash Static speculation
Hardware Techniquesndash Dynamic schedulingndash Dynamic issue (ie superscalar)ndash Dynamic branch predictionndash Dynamic disambiguationndash Dynamic speculation
Wide variety of approaches both hardware and compiler intensive
Lower hardware complexity
More longer range analysis
More machine dependence
Lower hardware complexity
More longer range analysis
More machine dependence
More stable performance
Higher complexity
Potential clock rate impact
More stable performance
Higher complexity
Potential clock rate impact
No clear cut winners at the present
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Big Picture--ILP and Memory Big Picture--ILP and Memory SystemsSystems
Simplepipelining
Scheduledpipelines
Multipleissue
Dynamicschedulin
g
Speculation
My view
bullNo performance wall but steeper slopes ahead
bullEasier territory is behind us
bullIndustry-research gap vanished
bullEnergy efficiency may be key limit
ILPMountai
n
Multilevelcaches amp buffers
Critical word amp early restart
Compilerprefetchin
g
Multipathprefetchin
g
Simplecaches
CacheMountai
n
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Per
form
ance
01
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Microprocessors today where Microprocessors today where they are and what can dothey are and what can do
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Tran
sist
ors
1000
10000
100000
1000000
10000000
100000000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level ()
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Microprocessors where they goMicroprocessors where they go
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Intel more TransistorIntel more Transistor
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Intel Faster DevicesIntel Faster Devices
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Number of Transistors in Number of Transistors in Intelrsquos processorsIntelrsquos processors
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Higher level parallelismHigher level parallelism
Several approaches have been proposed to go beyond optimizing single-thread performance (latency) and to exploit higher performance (throughput) at better energy efficiency
The more prononuced are
a) simultaneons multithreaded (SMT) processor and
b) chip multiprocessors (CMT)
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
MultithreadingMultithreading
Microprocessor can execute multiple operations at a time4 or 6 operations per cycle
Hard to achieve this level of parallelism from single program
Can we run multiple programs (threads) on (single) processor without much effort
Simultaneous multithreading (SMT) or Hyperthreading is a solution
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Parallel Thread Sequencing ModelParallel Thread Sequencing Model
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Principles of SMTPrinciples of SMT
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Multithreading in todayrsquos Multithreading in todayrsquos processorsprocessors
Today many high-end microprocessors are multithreaded (eg Intel Pentium 4)
Support for 2-4 threads but expect to get only 13X improvement in throughput
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Chip MultiprocessorChip Multiprocessor
Several processor cores in one die
Shared L2 caches
Chip Communication to build multichip module with many CMPs + memory
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model
CMP is a simple very powerful techiniques to obtain more performance in a power-effecient manner
The idea is to put several microprocessors on a single die This type of architecture is reffered also as Multiprocessor System-on-Chip (MPSoC)
The performance of small-scale CMP scales close to linear with the number of microprocessors and is likely to exceed the performance of an equivalent multiprocessor system
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Chip multiprocessor (CMP) Chip multiprocessor (CMP) platform modelplatform model - continue - continue
CMP is an atractive option to use when moving to a new process technology such as SoC
Typical MPSoC applications we meet in network processors multimedia hubs signal processors etc
MPSoCs are usually implemented as heterogenous systems
CMT and SMT can coexist-a CMP die can integrate several SMT microprocessors
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Generic circa 2010 Generic circa 2010 MicroprocessorMicroprocessor
4 ndash 8 general-purpose processing engines on chip used to execute independent programs
Explicitly parallel programs (when possible) Speculatively parallel threads
Special-purpose processing units (eg DSP functionality)
Elaborate memory hierarchyElaborate inter-chip communication facilities
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Characteristics of superscalar simultaneous Characteristics of superscalar simultaneous multithreading and chip multiprocessor architecturesmultithreading and chip multiprocessor architectures
Characteristic SuperscalarSimultaneousmultithreading
Chipmultiprocessor
Number of CPUs 1 1 8CPU issue width 12 12 2 per CPUNumber of threads 1 8 1 per CPUArchitecture registers (for integer and floating point) 32 32 per thread 32 per CPUPhysical registers (for integer and floating point 32 + 256 256 + 256 32 + 32 per CPUInstruction window size 256 256 32 per CPUBranch predictor table size (entries) 32768 32768 8x4096Return stack size 64 entries 64 entries 8x8 entriesInstruction (I) and data (D) cache organization 1x8 banks 1x8 banks 1 bankI and D cache sizes 128 kbytes 128 kbytes 16 kbytes per CPUI and D cache associativities 4-way 4-way 4-wayI and 0 cache line sizes (bytes) 32 32 32I and P cache access times (cycles) 2 2 1Secondary cache organization (Mbytes) 1x8 banks 1x8 banks 1x8 banksSecondary cache size (bytes) 8 8 8Secondary cache associativity 4-way 4-way 4-waySecondary cache line size (bytes) 32 32 32Secondary cache access time (cycles) 5 5 7Secondary cache occupancy per access (cycles) 1 1 1Memory organization (no of banks) 4 4 4Memory access time (cycles) 50 50 50Memory occupancy per access (cycles) 13 13 13
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
The microprocessor tomorrowThe microprocessor tomorrow When we say ldquomicroprocessorrdquo tomorrow we generally mean the shaded area of the figure
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
OutlineOutline
bullTechnology TrendsTechnology TrendsbullProcess Technology Challenges ndash Low Power DesignProcess Technology Challenges ndash Low Power DesignbullMicroprocessorsrsquo GenerationsMicroprocessorsrsquo GenerationsbullChallenges in EducationChallenges in Education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Outline Outline Challenges in EducationChallenges in Education
bullChanges in curriculaChanges in curriculabullFundamentalsFundamentalsbullA sort of the challenge we should acceptA sort of the challenge we should accept
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
ChalengeChalengess in Education in Education
It has often said that Where you stand depends on where you sit
In this context starting from our positions an experiences this is our view concerning the theme How shall we satisfy the long-term educational needs of engineers
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
How to organize a training of new How to organize a training of new engineers engineers
The engineers we are training today will still be practicing 40 years from now
Are we preparing them for what they will be doing then
Is the whole system of engineering education ndash not just the undergraduate curriculum ndash organized to support todayrsquos graduate for the next 40 years
We think not on both counts
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Our view amp our experienceOur view amp our experience
Our view is that the practice of engineering is rapidly changing and that engineering education is not keeping up
Our experiences are primarily in information technology (both in academy and industry) which admittedly has changed more rapidly than same other fields
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Changes in curriculaChanges in curricula
It is almost a clicheacute to talk about change ndash so mach so that a passing reference to it becomes a substitute for serious thought about its implications
But the fact is that the practice of engineering is changing at about the same pace as the technology it creates
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
What are fundamentalsWhat are fundamentals
The undergraduate curriculum should teach (only) fundamentals
Everyone agrees with that
But what are fundamentals
Since the adoption of the engineering science model the fundamentals have been largely continuous mathematics and physics
But as we said earlier engineering is changing
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
What kinds of fundamentals we What kinds of fundamentals we need now ndash some examplesneed now ndash some examples
Information technology (IT) will be embedded in virtually engineered product and process in the future ndash ie the design space for all engineers will include ITDiscrete mathematics not continuous math is the underpinning of ITIt is a new fundamental
Biological materials and process are a bit behind IT in their impact on engineering but they a closing fastThus the chemical and biological sciences are also becoming fundamental to engineering
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
Kinds of FundamentalsKinds of Fundamentals
Engineering systems are increasingly complex and increasingly contain components from across the spectrum of traditional engineering fieldsMore knowledge of the full spectrum will be the fundamental
Engineering is global and is performed in a holistic business contextThe engineer must design under constraints that include global cultural and business contexts and so must understand themThey two are new fundamentals
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
How to add these new fundamentalsHow to add these new fundamentals
The challenge is that we cannot just add these new fundamentals to a curriculum that is already too full
We have to look critically at the current cherished fundamentals and either displace them or find ways to cover them much more rapidly
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
What will the character and essence of What will the character and essence of electrical and computer engineering electrical and computer engineering
education look like in the future education look like in the future
It is difficult to predict the future with any accuracy but it is safe to say that
Web-based teachingdistance learningelectronic books and
interactive learning environments
will play increasingly significant roles in shaping what we teach how we teach and how students learn
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education
A sort of challenge we should acceptA sort of challenge we should accept
During one visit at our faculty Prof Krishna Shenai from University of Illinois of Chicago director of Micro Systems Research Center says to us that he has never seen a process that cannot be speeded up by a factor of two and improved in quality at the same time
That is the sort of challenge we should accept for improving engineering education