3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory...

38
3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    2

Transcript of 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory...

Page 1: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 1

VLSI ArchitecturePast, Present, and Future

William J. DallyComputer Systems Laboratory

Stanford University

March 23, 1999

Page 2: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 2

Past, Present, and Future

• The last 20 years has seen a 1000-fold increase in grids per chip and a 20-fold reduction in gate delay

• We expect this trend to continue for the next 20 years

• For the past 20 years, these devices have been applied to implicit parallelism

• We will see a shift toward implicit parallelism over the next 20 years

Page 3: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 3

1960 1970 1980 1990 2000 201010

-1

100

101

102

103

wire pitch (um)

gate length (um)

Technology Evolution

1960 1970 1980 1990 2000 201010

-2

10-1

100

101

102

gate delay (ns)

Page 4: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 4

Technology Evolution (2)

Parameter 1979 1999 2019 Units

Gate Length 5 0.2 0.008 m

Gate Delay 3000 150 7.5 ps

Clock Cycle 200 2.5 0.08 ns

Gates/ Clock 67 17 10

Wire Pitch 15 1 .07 m

Chip Edge 6 15 38 mm

Grids/ Chip 1.6 105 2.3 108 3.0 1011

Page 5: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 5

Architecture Evolution

Year Microprocessor High-endProcessor

1979 i80860.5 MIPS0.001 MFLOPS

Cray 170 MIPS250 MFLOPS

1999 Compaq 21264500 MIPS, 500 MFLOPS (x 4?)

2019 X10000 MIPS,10000 MFLOPS

MP with 1000Xs

Page 6: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 6

Incremental Returns

Processor Cost (Die Area)

Per

form

ance

Pipelined RISC

Dual-issue in order

Quad-issue out of order

Page 7: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 7

Efficiency and Granularity

Pea

k P

erfo

rman

ce

2P+M

2P+2M

P+M

System Cost (Die Area)

Page 8: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 8

VLSI in 1979

Page 9: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 9

VLSI Architecture in 1979

• 5m NMOS technology• 6mm die size• 100,000 grids per chip, 10,000

transistors• 8086 microprocessor

– 0.5MIPS

Page 10: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 10

1979-1989: Attack of the Killer Micros

• 50% per year improvement in performance• Transistors applied to implicit parallelism

– pipeline processor (10 CPI --> 1 CPI)– shorten clock cycle

(67 gates/clock --> 30 gates/clock)

• in 1989 a 32-bit processor w/ floating point and caches fits on one chip– e.g., i860 40MIPS, 40MFLOPS– 5,000,000 grids, 1M transistors (many memory)

Page 11: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 11

1989-1999: The Era of Diminishing Returns

• 50% per year increase in performance through 1996, but– projects delayed, performance below expectations– 50% increase in grids, 15% increase in frequency (72%

total)

• Squeaking out the last implicit parallelism– 2-way to 6-way issue, out-of-order issue, branch prediction– 1 CPI --> 0.5 CPI, 30 gates/clock --> 20 gates/clock

• Convert data parallelism to ILP• Examples

– Intel Pentium II (3-way o-o-o)– Compaq 21264 (4-way o-o-o)

Page 12: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 12

1979-1999: Why Implicit Parallelism?

• Opportunity– large gap between micros and fastest

processors

• Compatibility– software pool ready to run on implicitly

parallel machines

• Technology– not available for fine-grain explicitly parallel

machines

Page 13: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 13

1999-2019: Explicit Parallelism Takes Over

• Opportunity– no more processor gap

• Technology– interconnection, interaction, and shared

memory technologies have been proven

Page 14: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 14

Technology for Fine-Grain Parallel Machines

• A collection of workstations does not make a good parallel machine. (BLAGG)– Bandwidth - large fraction (0.1) of local

memory BW– LAtency - small multiple (3) of local memory

latency– Global mechanisms - sync, fetch-and-op– Granularity - of tasks (100 inst) and memory

(8MB)

Page 15: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 15

Technology for Parallel MachinesThree Components

• Networks– 2 clocks/hop latency– 8GB/s global bandwidth

• Interaction mechanisms– single-cycle communication and

synchronization

• Software

Page 16: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

0

10

20

30

40

50

60

70

0 2 4 6 8 10 12

k-ary n-cubes

• Link bandwidth, B, depends on radix, k, for both wire- and pin-limited networks.

• Select radix to trade-off diameter, D, against B.

TLB

D

TL

Cknk 4

Dally, “Performance Analysis of k-ary n-cube Interconnection Networks”, IEEE TC, 1990

4K NodesL = 256Bs= 16K

Late

ncy

Dimension

Page 17: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

Delay of Express Channels

Page 18: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

The Torus Routing Chip

• k-ary n-cube topology– 2D Torus Network– 8bit x 20MHz Channels

• Hardware routing• Wormhole routing• Virtual channels• Fully Self-Timed Design• Internal Crossbar

Architecture

Dally and Seitz, “The Torus Routing Chip”, Distributed Computing, 1986

Page 19: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

The Reliable Router

Dally, Dennison, Harris, Kan, and Xanthopoulos, “Architecture and Implementation of the Reliable Router”, Hot Interconnects II, 1994Dally, Dennison, and Xanthopoulos, “Low-Latency Plesiochronous Data Retiming, “ ARVLSI 1995Dennison, Lee, and Dally, “High Performance Bidirectional Signalling in VLSI Systems,” SIS 1993

• Fault-tolerant – Adaptive routing (adaptation of

Duato’s algorithm)– Link-level retry– Unique token protocol

• 32bit x 200MHz channels– Simultaneous bidirectional

signalling– Low latency plesiochronous

synchronizers

• Optimisitic routing

Page 20: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 20

Equalized 4Gb/s Signaling

Page 21: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

End-to-End Latency

• Software sees ~10s latency with 500ns network

• Heavy compute load associated with sending a message– system call– buffer allocation– synchronization

• Solution: treat the network like memory, not like an I/O device– hardware formatting,

addressing, and buffer allocation

Regs

Send

Net

Buffer Dispatch

Tx Node

Rx Node

Page 22: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 22

Network Summary

• We can build networks with 2-4 clocks/hop latency (12-24 clocks for a 512-node 3-cube)– networks faster than main memory access of modern

machines– need end-to-end hardware support to see this, no ‘libraries’

• With high-speed signaling, bandwdith of 4GB/s or more per channel (512GB/s bisection) is easy to achieve– nearly flat memory bandwidth

• Topology is a matter of matching pin and bisection constraints to the packaging technology– its hard to beat a 3-D mesh or torus

• This gives us B and LA (of BLAGG)

Page 23: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 23

The Importance of Mechanisms

B

S eria l E xecution

A

Page 24: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 24

The Importance of Mechanisms

A

B

S eria l E xecution

O V H

C O MBO V H

C O MS ync

A

P ara lle l E xecution (H igh O vherhead 0 .5)

Page 25: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 25

The Importance of Mechanisms

A

B

S eria l E xecution

O V H

C O MBO V H

C O MS ync

A

P ara lle l E xecution (H igh O vherhead 0 .5)

A

B

P ara lle l E xecution(Low O vherhead 0 .062)

Page 26: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 26

Granularity and Cost Effectiveness

• Parallel Computers Built for– Capability - run problems that are

too big or take too long to solve any other way

• absolute performance at any cost

– Capacity - get throughput on lots of small problems

• performance/cost

• A parallel computer built from workstation size nodes will always have lower perf/cost than a workstation– sublinear speedup– economies of scale

• A parallel computer with less memory per node can have better perf/cost than a workstation

MP $

M M M M

P $ P $ P $ P $

Page 27: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 27

MIT J-Machine (1991)

Page 28: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 28

Exploiting fine-grain threads

• Where will the parallelism come from to keep all of these processors busy?– ILP - limited to about 5– Outer-loop parallelism

• e.g., domain decomposition• requires big problems to get

lots of parallelism

• Fine threads– make communication and

synchronization very fast (1 cycle)

– break the problem into smaller pieces

– more parallelism

Page 29: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 29

Mechanism and Granularity Summary

• Fast communication and synchronization mechanisms enable fine-grain task decomposition– simplifies programming– exposes parallelism– facilitates load balance

• Have demonstrated– 1-cycle communication and synchronization locally– 10-cycle communication, synchronization, and task

dispatch across a network

• Physically fine-grain machines have better performance/cost than sequential machines

Page 30: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 30

A 2009 Multicomputer

System : 16 C h ips C hip : 64 T iles

M em ory8M B

Processor

T ile : P + 8M B

Page 31: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 31

Challenges for the Explicitly Parallel Era

• Compatibility• Managing locality• Parallel software

Page 32: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 32

Compatibility

• Almost no fine-grain parallel software exists

• Writing parallel software is easy– with good mechanisms

• Parallelizing sequential software is hard– needs to be designed from the ground up

• An incremental migration path– run sequential codes with acceptable

performance– parallelize selected applications for

considerable speedup

Page 33: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 33

Performance Depends on Locality

• Applications have data/time-dependent graph structure– Sparse-matrix solution

• non-zero and fill-in structure

– Logic simulation• circuit topology and activity

– PIC codes• structure changes as particles

move

– ‘Sort-middle’ polygon rendering

• structure changes as viewpoint moves

Page 34: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 34

Fine-Grain Data MigrationDrift and Diffusion

• Run-time relocation based on pointer use– move data at both ends of pointer– move control and data

• Each ‘relocation cycle’– compute drift vector based on

pointer use– compute diffusion vector based on

density potential (Taylor)– need to avoid oscillations

• Should data be replicated?– not just update vs. invalidate– need to duplicate computation to

avoid communication

Page 35: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 35

Migration and Locality

0

1

2

3

4

5

6

1 5 9 13 17 21 25 29 33 37

Migration Period

Dis

tan

ce (

in t

iles

)

NoMigration

OneStep

Hierarchy

Mixed

Page 36: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 36

Parallel Software:Focus on the Real Problems

• Almost all demanding problems have ample parallelism

• Need to focus on fundamental problems– extracting parallelism– load balance – locality

• load balance and locality can be covered by excess parallelism

• Avoid incidental issues– aggregating tasks to avoid overhead– manually managing data movement and replication– oversynchronization

Page 37: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 37

Parallel Software:Design Strategy

• A program must be designed for parallelism from the ground up– no bottlenecks in the data structures

• e.g., arrays instead of linked lists

• Data parallelism– many for loops (over data,not time) can be

forall– break dependencies out of the loop– synchronize on natural units (no barriers)

Page 38: 3/23/99: 1 VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999.

3/23/99: 38

Conclusion: We are on the threshold of the explicitly parallel era

• As in 1979, we expect a 1000-fold increase in ‘grids’ per chip in the next 20 years

• Unlike 1979 these ‘grids’ are best applied to explicitly parallel machines– Diminishing returns from sequential processors (ILP) - no

alternative to explicit parallelism– Enabling technologies have been proven

• interconnection networks, mechanisms, cache coherence

– Fine-grain machines are more efficient than sequential machines

• Fine-grain machines will be constructed from multi-processor/DRAM chips

• Incremental migration to parallel software