Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft

Bell & Gray 3 / 95

1

NetworkPlatform

Jim Gray 310 Filbert, SF CA 94133

[email protected]

Gordon Bell450 Old Oak CourtLos Altos, CA [email protected]

Bell & Gray 3 / 95

2

MetaMessage: Technology Ratios Are Important

MetaMessage: Technology Ratios Are Important

• If everything gets faster&cheaper at the same rate THEN nothing really changes.

•

Things getting MUCH BETTER (104 x in 25 years):

–

communication speed & cost

–

processor speed & cost (PAP)

–

storage size & cost

• Things getting a little better (10 x in 25 years)

– storage latency & bandwidth

– real application performance (RAP)

• Things staying about the same

– speed of light (more or less constant)

– people (10x more expensive)

Bell & Gray 3 / 95

3

Consequent MessageConsequent Message• Processing and Storage are WONDERFULLY cheaper

• Storage latencies not much improved

• Must get performance (RAP) via

– Pipeline parallelism and (mask latency)

– Partition parallelism (bandwidth and mask latency)

• Scaleable Hardware/Software architecture

– Scaleable & Commodity Network / Interconnect

– Commodity Hardware (processors, disks, memory)

– Commodity Software (OS, PL, Apps)

– Scaleability thru automatic parallel programming

– Manage & program as a single system

– Mask faults

Bell & Gray 3 / 95

4

OutlineOutline• Storage trends force pipeline & partition parallelism

– Lots of bytes & bandwidth per dollar

– Lots of latency

• Processor trends force pipeline & partition

– Lots of MIPS per dollar

– Lots of processors

– Putting it together

Bell & Gray 3 / 95

5

Moore's Law:Exponential Change means continual rejuvenation

• XXX doubles every 18 months 60% increase per year

– Micro Processor speeds

– CMOS chip density (memory chips)

– Magnetic disk density

– Communications bandwidth

• WAN bandwidth approaching LANs

• Exponential Growth:

– The past does not matter

– 10x here, 10x there, soon you're

Bell & Gray 3 / 95

6

Moore’s Law For Memory

128KB

128MB

20008KB

1MB

8MB

1GB

1970 1980 19901Kbit

640K

4K 16K 64K 256K 1M 4M 16M 64M 256M

1 chip memory size ( 2 MB to 32 MB)

Will Moore's Law continue to hold?

Bell & Gray 3 / 95

7

Moore's Law for Memory

128M

128K

2000

8K

1M

8M

1G

8G

1970 1980 1990

512

64

8

1

4K

32K

$50

$400

$3k

$25k

$200k

$1.6m

$61/8th chip

1Kbit

640KDOSlimit

4K 16K 64K 256K 1M 4M 16M 64M 256M

Memory Price @ $50/chip

Number ofchips

32MB

128MB

8MB

1GB

4GBCapacity with 64Mb DRAMs

Bell & Gray 3 / 95

8

Trends: Storage Got CheaperTrends: Storage Got Cheaper• $/byte got 104 better

• $/access got 103 better

• capacity grew 103

• Latency down 10x

• Bandwidth up 10x

1e 2

1e 6

1970 1980 1990 2000

1e 8

1e 7

1e 6

1e 5

1e 4

1e 3

Disk

RAM

TapeB/$

1e 0

1e 4

Tape (a/hr)

Disk (a/min)

RAM (a/s)

1970 1980 1990 2000

Bell & Gray 3 / 95

9

Partition Parallelism Gives BandwidthPartition Parallelism Gives Bandwidth

• parallelism: use many little devices in parallel

• Solves the bandwidth problem

• Beware of the media myth

• Beware of the access time myth

1 Terabyte

10 MB/s

At 10 MB/s: 1.2 days to scan

1 Terabyte

1,000 x faster: 2 minutes to scan

Bell & Gray 3 / 95

10

Partitioned Data has Natural ParallelismSplit a SQL Table across many disks, memories, processors.

Partition and/or Replicate data to get parallel disk access

A...E F...J K...N O...S T...Z

A Table

Bell & Gray 3 / 95

11

Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs

Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs

1

1015

1012

109

106

103

Size vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

Main

Secondary

Disc

Nearline Tape Offline

Tape

OnlineTape

104

102

100

10-2

10-4

Price vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

MainSecondary

Disc

Nearline Tape

OfflineTape

OnlineTape

Size(B) $/MB

Bell & Gray 3 / 95

12

Trends: Application Storage Demand Grew

Trends: Application Storage Demand Grew

• The New World:

– Billions of objects

– Big objects (1MB)

• The Old World:– Millions of

objects– 100-byte objects

People

Name Address Papers Picture Voice

Mike

Won

David NY

Berk

Austin

People

Name Address

Mike

Won

David NY

Berk

Austin

Paperless officeLibrary of congress onlineAll information online entertainment publishing businessInformation Network, Knowledge Navigator, Information at your fingertips

Bell & Gray 3 / 95

13

Good News: Electronic Storage Ratios Beat Paper

Good News: Electronic Storage Ratios Beat Paper

• File Cabinet: cabinet (4 drawer) 250$paper (24,000 sheets) 250$space (2x3 @ 10$/ft2) 180$total 700$3 ¢/sheet

• Disk: disk (8 GB =) 4,000$ASCII: 4 m pages0.1 ¢/sheet (30x cheaper)

• Image: 200 k pages2 ¢/sheet (similar to paper)

• Store everything on disk

Bell & Gray 3 / 95

25

Summary (of storage)Summary (of storage)• Capacity and cost are improving fast (100x per decade)

• Accesses are getting larger (MOX, GOX, SCANS)

• BUT Latencies and bandwidth are not improving much• (3x per decade)

• How to deal with this???

• Bandwidth:

– Use partitioned parallel access (disk & tape farms)

• Latency

– Pipeline data up storage hierarchy (next section)

Bell & Gray 3 / 95

26

Interesting Storage RatiosInteresting Storage Ratios• Disk is back to 100x cheaper than RAM

• Nearline tape is only 10x cheaper than disk

– and the gap is closing!

100:1

10:1

1:1

1960 1970 1980 1990 2000

RAM $/MBDisk $/MB

30:1

?Disk $/MBNearline Tape

??? Why bother with Tape

Disk & DRAM look good

Bell & Gray 3 / 95

27



– Lots of latency





Bell & Gray 3 / 95

28

MicroProcessor Speeds Went Up FastMicroProcessor Speeds Went Up Fast• Clock rates went from 10Khz to 300Mhz

• Processors now 4x issue

• SPECInt92 fits in Cache,

– it tracks cpu speed

• Peak Advertised Performance (PAP) is 1.2 BIPS

• Real Application Performance (RAP) is 60 MIPS

• Similar curves for

– DEC VAX & Alpha

– HP/PA

– IBM R6000/ PowerPC

– MIPS & SGI

– SUN0.1

1

10

100

1000

1980 1990 2000

8088

286

386486

PentiumP6

Intel MicroProcessorSpeeds (mips) source: Intel

Bell & Gray 3 / 95 29

System SPECint vs Price System SPECint vs Price

NCR 3600 AP

50040030020010000

100

200

300

400

500

600

700 486@66 PCs

NCR 3525

NCR 3555

Tricord ES 5K

HP 9000

SUN 1000

SUN 2000

SGI L SGI XL

Price ($s)

to 16 proc.

Pentium

Compaq

Bell & Gray 3 / 95

30

Micros Live Under the Super Curve

Micros Live Under the Super Curve

• Super GFLOPS went up

– uni-processor 20x in 20 years

– SMP 600x in 20 years

• Microprocessor SPECint went up

– CAG between 40% and 70%

• Microprocessors meet Supers

– same clock speeds soon

• FUTURE:

– modest UniProcessor Speedups

– Must use multiple processors

– (or maybe 1 chip is different?) 1

10

100

1000

1985 1990 1995

MicroVax

45% CAG

70% CAG

Sun

Intel Clock, 1979-1995 = 42% CAG

Workstation SpecInt vs Time

20x in 20 years

0.1

1

10

100

UP GFlops

16% CAG

MP GFlops600x in 20 years

38% CAG

Cray GFLOPS vs time

Bell & Gray 3 / 95

31

• PAP: Peak Advertised Performance:

– 300Mhz x 4x = 1.2 BIPS

• RAP: Real Application Performance on Memory Intensive Applications (MIA = commercial):

– 2%-4% L2 cache miss, 40MIPS to 80 MIPS

• MIA UP RAP improved 50x in 30 years:

– Cray 6600 @ 1.4 MIPS in 1964

– Alpha @ 70MIPS in 1994

• Microprocessors have been growing up under the memory barrier

• Mainframes have been at the memory barrier

PAP vs RAP: Max Memory Performance 10x Better

PAP vs RAP: Max Memory Performance 10x Better

0.1

1

10

100

1000

IBM 7094

1960 1970 1980 1990 2000

IBM 360/195

3090 Amdahl

IBM, Amdahl, Alpha

Memory Barrier

CDC 6600

Bell & Gray 3 / 95

32

Growing Up Under the Super CurveGrowing Up Under the Super Curve

• Cray & IBM & Amdahl are Fastest Possible (at that time for N megabucks)

• Have GREAT! memory and IO

• Commodity systems growing up under the super memory cloud.

• Near the limit.

• Interesting times ahead

– use parallelism to get speedupDatamation Sort: cpu time only

100

1000

10000

100000

1000000

1985 1990 1995

Sort Records/second vs Time

M68000

Cray YMP

Amdahl

Tandem

Hardware Sorter

Sequent

Alpha

Intel

HyperCube

Cray C90

Bell & Gray 3 / 95

33

Thesis: Performance =Storage Accesses not Instructions Executed

Thesis: Performance =Storage Accesses not Instructions Executed

• In the “old days” we counted instructions and IO’s

• Now we count memory references

• Processors wait most of the time

Where the time goes: clock ticks used by AlphaSort Components

SortDisc Wait SortDisc Wait OS

Memory Wait

D-Cache Miss

I-Cache MissB-Cache

Data Miss

70 MIPS“real” apps have worse Icache misses so run at 60 MIPSif well tuned, 20 MIPS if not

Bell & Gray 3 / 95

34

Storage Latency: How Far Away is the Data?

Storage Latency: How Far Away is the Data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

10 9

106

Sacramento

This CampusThis Room

My Head

10 min

1.5 hr

2 Years

1 min

Pluto

2,000 YearsAndromdeda

Clo

ck T

icks

Bell & Gray 3 / 95

35

The Pico Processor

1 M SPECmarks, 1TFLOP

106 clocks to bulk ram

Event-horizon on chip.

VM reincarnated

Multi-program cacheOn-Chip SMP

Terror Bytes!

10 microsecond ram

10 millisecond disc

10 second tape archive

10 nano-second ram

Pico Processor

10 pico-second ram

1 MM 3

100 TB

1 TB

10 GB

1 MB

100 MB

Bell & Gray 3 / 95

36

Masking Memory LatencyMasking Memory Latency• MicroProcessors got 10,000x faster & cheaper

• Main memories got 10x faster

• So... how get more work from memory?

– cache memory to hide latency (reuse data)

– wide memory for bandwidth

– pipeline memory access to hide latency

– SMP & threads for partitioned memory accessSMP commercial

0

200

400

600

800

1000

1200

0 2 4 6 8

tps vs cpus

cpus

tps

Pipeline scientific

UKWeather

Forecasting

1950 2000

1T

1G

1 M

1K

LeoMercury

KDF9

195

205

YMP

1010 / 50 yrs = 1.5850

FLOPS

Bell & Gray 3 / 95

37

DataFlow ProgrammingPrefetch & Postwrite Hide Latency

DataFlow ProgrammingPrefetch & Postwrite Hide Latency

Can't wait for the data to arrive (2,000 years!)Need a memory that gets the data in advance ( 100MB/S)

Solution: Pipeline data to/from the processor Pipe data from source (tape, disc, ram...) to cpu cache

Bell & Gray 3 / 95

38

Parallel Execution masks latency

Pipeline Mask Latency

Partition Increase Bandwidth Overlap computation with latency

Any Sequential Program






• Processors are pushing on the Memory Barrier

• MIA RAP << PAP so learn from the FLOPS

Bell & Gray 3 / 95

39



– Lots of latency





Bell & Gray 3 / 95

40

Thesis: Many Little Beat Few BigThesis: Many Little Beat Few Big

• How to connect the many little parts

• How to program the many little parts

• Fault tolerance?

1 M$100 K$ 10 K$

Mainframe MiniMicro Nano

14"9"

5.25" 3.5" 2.5" 1.8"

Bell & Gray 3 / 95

41

Clusters: Connecting Many LittleClusters: Connecting Many Little

Future Servers are CLUSTERSof processors, discs

Distributed Database techniquesmake clusters work

CPU

50 GB Disc

5 GB RAM

Bell & Gray 3 / 95

42

Success Stories: OLTPSuccess Stories: OLTP• Transaction Processing, Client/Server, File Server

have natural parallelism.

– lots of clients,

– lots of small independent requests

– Near-linear scaleup

–Support > 10 k clients• Examples

– Oracle/Rdb scales to 3.7k tpsA • on 5x4 Alpha Cluster

– Tandem Scales to 21k tpmC • on 1x110 Tandem cluster

–Shared nothing scales best1102 32

21k tpmC

cpus

Throughput vs CPUs

Bell & Gray 3 / 95

43

Success Stories: DecisionSuccess Stories: Decision• Relational databases are uniform streams of data

– allows pipelining (much like vector processing)

– allows partitioning (by range or hash)

• Relational operators are closed under composition

– output of operator can be streamed to next operator

• Get linear scaleup on SMP and SN » (Teradata, Tandem, Oracle, Informix,...)

A...E F...J K...N O...S T...Z

Merge

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Merge Merge

Bell & Gray 3 / 95

44

Scaleables: Uneconomic So FarScaleables: Uneconomic So Far• A Slice is a processor, memory, and a few disks.

• Slice Price of Scaleables so far is 5x to 10x markup

– Teradata: 70K$ for a Intel 486 + 32MB + 4 disk.

– Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk

– Intel: 75k$ for an I860 +32MB + 2 disk

– TMC: 75k$ for a SPARC 3 + 32MB + 2 disk.

– IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk

• Compaq Slice Price is less than 10k$

• What is the problem?

– Proprietary interconnect

– Proprietary packaging

– Proprietary software (vendorIX)

Bell & Gray 3 / 95

45

Network Trends & ChallengeNetwork Trends & Challenge• Bandwidth UP 104 Price went DOWN

• Speed-of-light and Distance unchanged

• Software got worse

• Standard Fast Nets

– ATM

– PCI

– Myrinet

– Tnet

• HOPE:

– Commodity Net

– Good software

• Then clusters become a SNAP! commodity: 10k$/slice

102

103

104

105

106

107

108

109

1010

POTS

WAN

LAN

CANPC Bus

1 Mb

1 Gb

1 Kb

20001995198519751965

Bell & Gray 3 / 95

46

Great Debate: Shared What?Great Debate: Shared What?

Shared Memory (SMP)

Shared Disk Shared Nothing (network)

CLIENTS CLIENTSCLIENTS

MemoryProcessors

Easy to programDifficult to buildDifficult to scaleup

Hard to programEasy to buildEasy to scaleup

Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2

Winner will be a synthesis of these ideasDistributed shared memory (DASH, Encore) blur Network

Bell & Gray 3 / 95

47

Architectural IssuesArchitectural Issues• Hardware will be parallel

• What is the programming model?

– can you hide locality? No, locality is critical

– If build SMP, must program as shared-nothing

• Will users learn to program in parallel?

– No, successful products give automatic parallelism

• With 100s of computers, what about management?– Administration costs 2.5k$/year/PC (lowest estimate)

– Cluster must be» As easy to manage as a single system (it is a single system)

» Faults diagnosed & masked automatically• Message based computation mode

• Transactions

• Checkpoint / Restart

Bell & Gray 3 / 95

48

SNAP Business IssuesSNAP Business Issues• Use commodity components (software & hardware)

– Intel won - compatibility is important

– ATM will probably win LAN & WAN, not CAN

– NT will probably win (UNIX too fragmented)

– SQL is wining parallel data access.

– What else?

• Automatic parallel programming

– Key to scaleability

– Desktop to glass house.

• Automatic management

– Key to economics

• Palmtops and mobile may be differentiated.

Bell & Gray 3 / 95

49

SNAP Systems circa 2000Local & global data commworld

ATM & Ethernet:PC, workstation,

& servers

Wide-area global

ATM network

Legacymainframe &

minicomputerservers & terminals

Centralized& departmental

servers built fromPCs

scalable computers built from PCs + CAN

TC=TV+PChome ...

(CATV or ATM or satellite)

???

Portables

A space, time (bandwidth), & generation scalable environment

Person servers (PCs)

MobileNets

Bell & Gray 3 / 95

50

The SNAP Software ChallengeThe SNAP Software Challenge

1,000 discs = 10 Terrorbytes

100 Tape Transports = 1,000 tapes = 1 PetaByte

100 Nodes 1 Tips

Hig

h S

peed

Netw

ork (100 G

b/s)

• Cluster & Network OS.

• Automatic Administration

• Automatic data placement

• Automatic parallel programming

• Parallel Query Optimization

• Parallel concepts, algorithms, tools

• Execution Techniques load balance, checkpoint/restart,

Bell & Gray 3 / 95

51



– Lots of latency




• Putting it together (Scaleable Networks and Platforms)

– Build clusters of commodity processors & storage

– Commodity interconnect is key (S of PMS)» Traditional interconnects give 100k$/slice.

– Commodity Cluster Operating System is key

– Fault isolation and tolerance is key

– Automatic Parallel Programming is key

Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft

Documents

Transcript of Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft