Windows NT Scalability Jim Gray Microsoft Research [email protected] http/Gray/talks
Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft
-
Upload
anastasia-mosley -
Category
Documents
-
view
25 -
download
1
description
Transcript of Jim Gray 310 Filbert, SF CA 94133 Gray@Microsoft
Bell & Gray 3 / 95
1
NetworkPlatform
Jim Gray 310 Filbert, SF CA 94133
Gordon Bell450 Old Oak CourtLos Altos, CA [email protected]
Bell & Gray 3 / 95
2
MetaMessage: Technology Ratios Are Important
MetaMessage: Technology Ratios Are Important
• If everything gets faster&cheaper at the same rate THEN nothing really changes.
•
Things getting MUCH BETTER (104 x in 25 years):
–
communication speed & cost
–
processor speed & cost (PAP)
–
storage size & cost
• Things getting a little better (10 x in 25 years)
– storage latency & bandwidth
– real application performance (RAP)
• Things staying about the same
– speed of light (more or less constant)
– people (10x more expensive)
Bell & Gray 3 / 95
3
Consequent MessageConsequent Message• Processing and Storage are WONDERFULLY cheaper
• Storage latencies not much improved
• Must get performance (RAP) via
– Pipeline parallelism and (mask latency)
– Partition parallelism (bandwidth and mask latency)
• Scaleable Hardware/Software architecture
– Scaleable & Commodity Network / Interconnect
– Commodity Hardware (processors, disks, memory)
– Commodity Software (OS, PL, Apps)
– Scaleability thru automatic parallel programming
– Manage & program as a single system
– Mask faults
Bell & Gray 3 / 95
4
OutlineOutline• Storage trends force pipeline & partition parallelism
– Lots of bytes & bandwidth per dollar
– Lots of latency
• Processor trends force pipeline & partition
– Lots of MIPS per dollar
– Lots of processors
– Putting it together
Bell & Gray 3 / 95
5
Moore's Law:Exponential Change means continual rejuvenation
• XXX doubles every 18 months 60% increase per year
– Micro Processor speeds
– CMOS chip density (memory chips)
– Magnetic disk density
– Communications bandwidth
• WAN bandwidth approaching LANs
• Exponential Growth:
– The past does not matter
– 10x here, 10x there, soon you're
Bell & Gray 3 / 95
6
Moore’s Law For Memory
128KB
128MB
20008KB
1MB
8MB
1GB
1970 1980 19901Kbit
640K
4K 16K 64K 256K 1M 4M 16M 64M 256M
1 chip memory size ( 2 MB to 32 MB)
Will Moore's Law continue to hold?
Bell & Gray 3 / 95
7
Moore's Law for Memory
128M
128K
2000
8K
1M
8M
1G
8G
1970 1980 1990
512
64
8
1
4K
32K
$50
$400
$3k
$25k
$200k
$1.6m
$61/8th chip
1Kbit
640KDOSlimit
4K 16K 64K 256K 1M 4M 16M 64M 256M
Memory Price @ $50/chip
Number ofchips
32MB
128MB
8MB
1GB
4GBCapacity with 64Mb DRAMs
Bell & Gray 3 / 95
8
Trends: Storage Got CheaperTrends: Storage Got Cheaper• $/byte got 104 better
• $/access got 103 better
• capacity grew 103
• Latency down 10x
• Bandwidth up 10x
1e 2
1e 6
1970 1980 1990 2000
1e 8
1e 7
1e 6
1e 5
1e 4
1e 3
Disk
RAM
TapeB/$
1e 0
1e 4
Tape (a/hr)
Disk (a/min)
RAM (a/s)
1970 1980 1990 2000
Bell & Gray 3 / 95
9
Partition Parallelism Gives BandwidthPartition Parallelism Gives Bandwidth
• parallelism: use many little devices in parallel
• Solves the bandwidth problem
• Beware of the media myth
• Beware of the access time myth
1 Terabyte
10 MB/s
At 10 MB/s: 1.2 days to scan
1 Terabyte
1,000 x faster: 2 minutes to scan
Bell & Gray 3 / 95
10
Partitioned Data has Natural ParallelismSplit a SQL Table across many disks, memories, processors.
Partition and/or Replicate data to get parallel disk access
A...E F...J K...N O...S T...Z
A Table
Bell & Gray 3 / 95
11
Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs
Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs
1
1015
1012
109
106
103
Size vs Speed
Access Time (seconds)10-9 10-6 10-3 10 0 10 3
Cache
Main
Secondary
Disc
Nearline Tape Offline
Tape
OnlineTape
104
102
100
10-2
10-4
Price vs Speed
Access Time (seconds)10-9 10-6 10-3 10 0 10 3
Cache
MainSecondary
Disc
Nearline Tape
OfflineTape
OnlineTape
Size(B) $/MB
Bell & Gray 3 / 95
12
Trends: Application Storage Demand Grew
Trends: Application Storage Demand Grew
• The New World:
– Billions of objects
– Big objects (1MB)
• The Old World:– Millions of
objects– 100-byte objects
People
Name Address Papers Picture Voice
Mike
Won
David NY
Berk
Austin
People
Name Address
Mike
Won
David NY
Berk
Austin
Paperless officeLibrary of congress onlineAll information online entertainment publishing businessInformation Network, Knowledge Navigator, Information at your fingertips
Bell & Gray 3 / 95
13
Good News: Electronic Storage Ratios Beat Paper
Good News: Electronic Storage Ratios Beat Paper
• File Cabinet: cabinet (4 drawer) 250$paper (24,000 sheets) 250$space (2x3 @ 10$/ft2) 180$total 700$3 ¢/sheet
• Disk: disk (8 GB =) 4,000$ASCII: 4 m pages0.1 ¢/sheet (30x cheaper)
• Image: 200 k pages2 ¢/sheet (similar to paper)
• Store everything on disk
Bell & Gray 3 / 95
25
Summary (of storage)Summary (of storage)• Capacity and cost are improving fast (100x per decade)
• Accesses are getting larger (MOX, GOX, SCANS)
• BUT Latencies and bandwidth are not improving much• (3x per decade)
• How to deal with this???
• Bandwidth:
– Use partitioned parallel access (disk & tape farms)
• Latency
– Pipeline data up storage hierarchy (next section)
Bell & Gray 3 / 95
26
Interesting Storage RatiosInteresting Storage Ratios• Disk is back to 100x cheaper than RAM
• Nearline tape is only 10x cheaper than disk
– and the gap is closing!
100:1
10:1
1:1
1960 1970 1980 1990 2000
RAM $/MBDisk $/MB
30:1
?Disk $/MBNearline Tape
??? Why bother with Tape
Disk & DRAM look good
Bell & Gray 3 / 95
27
OutlineOutline• Storage trends force pipeline & partition parallelism
– Lots of bytes & bandwidth per dollar
– Lots of latency
• Processor trends force pipeline & partition
– Lots of MIPS per dollar
– Lots of processors
– Putting it together
Bell & Gray 3 / 95
28
MicroProcessor Speeds Went Up FastMicroProcessor Speeds Went Up Fast• Clock rates went from 10Khz to 300Mhz
• Processors now 4x issue
• SPECInt92 fits in Cache,
– it tracks cpu speed
• Peak Advertised Performance (PAP) is 1.2 BIPS
• Real Application Performance (RAP) is 60 MIPS
• Similar curves for
– DEC VAX & Alpha
– HP/PA
– IBM R6000/ PowerPC
– MIPS & SGI
– SUN0.1
1
10
100
1000
1980 1990 2000
8088
286
386486
PentiumP6
Intel MicroProcessorSpeeds (mips) source: Intel
Bell & Gray 3 / 95 29
System SPECint vs Price System SPECint vs Price
NCR 3600 AP
50040030020010000
100
200
300
400
500
600
700 486@66 PCs
NCR 3525
NCR 3555
Tricord ES 5K
HP 9000
SUN 1000
SUN 2000
SGI L SGI XL
Price ($s)
to 16 proc.
Pentium
Compaq
Bell & Gray 3 / 95
30
Micros Live Under the Super Curve
Micros Live Under the Super Curve
• Super GFLOPS went up
– uni-processor 20x in 20 years
– SMP 600x in 20 years
• Microprocessor SPECint went up
– CAG between 40% and 70%
• Microprocessors meet Supers
– same clock speeds soon
• FUTURE:
– modest UniProcessor Speedups
– Must use multiple processors
– (or maybe 1 chip is different?) 1
10
100
1000
1985 1990 1995
MicroVax
45% CAG
70% CAG
Sun
Intel Clock, 1979-1995 = 42% CAG
Workstation SpecInt vs Time
20x in 20 years
0.1
1
10
100
UP GFlops
16% CAG
MP GFlops600x in 20 years
38% CAG
Cray GFLOPS vs time
Bell & Gray 3 / 95
31
• PAP: Peak Advertised Performance:
– 300Mhz x 4x = 1.2 BIPS
• RAP: Real Application Performance on Memory Intensive Applications (MIA = commercial):
– 2%-4% L2 cache miss, 40MIPS to 80 MIPS
• MIA UP RAP improved 50x in 30 years:
– Cray 6600 @ 1.4 MIPS in 1964
– Alpha @ 70MIPS in 1994
• Microprocessors have been growing up under the memory barrier
• Mainframes have been at the memory barrier
PAP vs RAP: Max Memory Performance 10x Better
PAP vs RAP: Max Memory Performance 10x Better
0.1
1
10
100
1000
IBM 7094
1960 1970 1980 1990 2000
IBM 360/195
3090 Amdahl
IBM, Amdahl, Alpha
Memory Barrier
CDC 6600
Bell & Gray 3 / 95
32
Growing Up Under the Super CurveGrowing Up Under the Super Curve
• Cray & IBM & Amdahl are Fastest Possible (at that time for N megabucks)
• Have GREAT! memory and IO
• Commodity systems growing up under the super memory cloud.
• Near the limit.
• Interesting times ahead
– use parallelism to get speedupDatamation Sort: cpu time only
100
1000
10000
100000
1000000
1985 1990 1995
Sort Records/second vs Time
M68000
Cray YMP
Amdahl
Tandem
Hardware Sorter
Sequent
Alpha
Intel
HyperCube
Cray C90
Bell & Gray 3 / 95
33
Thesis: Performance =Storage Accesses not Instructions Executed
Thesis: Performance =Storage Accesses not Instructions Executed
• In the “old days” we counted instructions and IO’s
• Now we count memory references
• Processors wait most of the time
Where the time goes: clock ticks used by AlphaSort Components
SortDisc Wait SortDisc Wait OS
Memory Wait
D-Cache Miss
I-Cache MissB-Cache
Data Miss
70 MIPS“real” apps have worse Icache misses so run at 60 MIPSif well tuned, 20 MIPS if not
Bell & Gray 3 / 95
34
Storage Latency: How Far Away is the Data?
Storage Latency: How Far Away is the Data?
RegistersOn Chip CacheOn Board Cache
Memory
Disk
12
10
100
Tape /Optical Robot
10 9
106
Sacramento
This CampusThis Room
My Head
10 min
1.5 hr
2 Years
1 min
Pluto
2,000 YearsAndromdeda
Clo
ck T
icks
Bell & Gray 3 / 95
35
The Pico Processor
1 M SPECmarks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip.
VM reincarnated
Multi-program cacheOn-Chip SMP
Terror Bytes!
10 microsecond ram
10 millisecond disc
10 second tape archive
10 nano-second ram
Pico Processor
10 pico-second ram
1 MM 3
100 TB
1 TB
10 GB
1 MB
100 MB
Bell & Gray 3 / 95
36
Masking Memory LatencyMasking Memory Latency• MicroProcessors got 10,000x faster & cheaper
• Main memories got 10x faster
• So... how get more work from memory?
– cache memory to hide latency (reuse data)
– wide memory for bandwidth
– pipeline memory access to hide latency
– SMP & threads for partitioned memory accessSMP commercial
0
200
400
600
800
1000
1200
0 2 4 6 8
tps vs cpus
cpus
tps
Pipeline scientific
UKWeather
Forecasting
1950 2000
1T
1G
1 M
1K
LeoMercury
KDF9
195
205
YMP
1010 / 50 yrs = 1.5850
FLOPS
Bell & Gray 3 / 95
37
DataFlow ProgrammingPrefetch & Postwrite Hide Latency
DataFlow ProgrammingPrefetch & Postwrite Hide Latency
Can't wait for the data to arrive (2,000 years!)Need a memory that gets the data in advance ( 100MB/S)
Solution: Pipeline data to/from the processor Pipe data from source (tape, disc, ram...) to cpu cache
Bell & Gray 3 / 95
38
Parallel Execution masks latency
Pipeline Mask Latency
Partition Increase Bandwidth Overlap computation with latency
Any Sequential Program
Any Sequential Program
Any Sequential Program
Any Sequential Program
Any Sequential Program
Any Sequential Program
• Processors are pushing on the Memory Barrier
• MIA RAP << PAP so learn from the FLOPS
Bell & Gray 3 / 95
39
OutlineOutline• Storage trends force pipeline & partition parallelism
– Lots of bytes & bandwidth per dollar
– Lots of latency
• Processor trends force pipeline & partition
– Lots of MIPS per dollar
– Lots of processors
– Putting it together
Bell & Gray 3 / 95
40
Thesis: Many Little Beat Few BigThesis: Many Little Beat Few Big
• How to connect the many little parts
• How to program the many little parts
• Fault tolerance?
1 M$100 K$ 10 K$
Mainframe MiniMicro Nano
14"9"
5.25" 3.5" 2.5" 1.8"
Bell & Gray 3 / 95
41
Clusters: Connecting Many LittleClusters: Connecting Many Little
Future Servers are CLUSTERSof processors, discs
Distributed Database techniquesmake clusters work
CPU
50 GB Disc
5 GB RAM
Bell & Gray 3 / 95
42
Success Stories: OLTPSuccess Stories: OLTP• Transaction Processing, Client/Server, File Server
have natural parallelism.
– lots of clients,
– lots of small independent requests
– Near-linear scaleup
–Support > 10 k clients• Examples
– Oracle/Rdb scales to 3.7k tpsA • on 5x4 Alpha Cluster
– Tandem Scales to 21k tpmC • on 1x110 Tandem cluster
–Shared nothing scales best1102 32
21k tpmC
cpus
Throughput vs CPUs
Bell & Gray 3 / 95
43
Success Stories: DecisionSuccess Stories: Decision• Relational databases are uniform streams of data
– allows pipelining (much like vector processing)
– allows partitioning (by range or hash)
• Relational operators are closed under composition
– output of operator can be streamed to next operator
• Get linear scaleup on SMP and SN » (Teradata, Tandem, Oracle, Informix,...)
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Merge Merge
Bell & Gray 3 / 95
44
Scaleables: Uneconomic So FarScaleables: Uneconomic So Far• A Slice is a processor, memory, and a few disks.
• Slice Price of Scaleables so far is 5x to 10x markup
– Teradata: 70K$ for a Intel 486 + 32MB + 4 disk.
– Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk
– Intel: 75k$ for an I860 +32MB + 2 disk
– TMC: 75k$ for a SPARC 3 + 32MB + 2 disk.
– IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk
• Compaq Slice Price is less than 10k$
• What is the problem?
– Proprietary interconnect
– Proprietary packaging
– Proprietary software (vendorIX)
Bell & Gray 3 / 95
45
Network Trends & ChallengeNetwork Trends & Challenge• Bandwidth UP 104 Price went DOWN
• Speed-of-light and Distance unchanged
• Software got worse
• Standard Fast Nets
– ATM
– PCI
– Myrinet
– Tnet
• HOPE:
– Commodity Net
– Good software
• Then clusters become a SNAP! commodity: 10k$/slice
102
103
104
105
106
107
108
109
1010
POTS
WAN
LAN
CANPC Bus
1 Mb
1 Gb
1 Kb
20001995198519751965
Bell & Gray 3 / 95
46
Great Debate: Shared What?Great Debate: Shared What?
Shared Memory (SMP)
Shared Disk Shared Nothing (network)
CLIENTS CLIENTSCLIENTS
MemoryProcessors
Easy to programDifficult to buildDifficult to scaleup
Hard to programEasy to buildEasy to scaleup
Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2
Winner will be a synthesis of these ideasDistributed shared memory (DASH, Encore) blur Network
Bell & Gray 3 / 95
47
Architectural IssuesArchitectural Issues• Hardware will be parallel
• What is the programming model?
– can you hide locality? No, locality is critical
– If build SMP, must program as shared-nothing
• Will users learn to program in parallel?
– No, successful products give automatic parallelism
• With 100s of computers, what about management?– Administration costs 2.5k$/year/PC (lowest estimate)
– Cluster must be» As easy to manage as a single system (it is a single system)
» Faults diagnosed & masked automatically• Message based computation mode
• Transactions
• Checkpoint / Restart
Bell & Gray 3 / 95
48
SNAP Business IssuesSNAP Business Issues• Use commodity components (software & hardware)
– Intel won - compatibility is important
– ATM will probably win LAN & WAN, not CAN
– NT will probably win (UNIX too fragmented)
– SQL is wining parallel data access.
– What else?
• Automatic parallel programming
– Key to scaleability
– Desktop to glass house.
• Automatic management
– Key to economics
• Palmtops and mobile may be differentiated.
Bell & Gray 3 / 95
49
SNAP Systems circa 2000Local & global data commworld
ATM & Ethernet:PC, workstation,
& servers
Wide-area global
ATM network
Legacymainframe &
minicomputerservers & terminals
Centralized& departmental
servers built fromPCs
scalable computers built from PCs + CAN
TC=TV+PChome ...
(CATV or ATM or satellite)
???
Portables
A space, time (bandwidth), & generation scalable environment
Person servers (PCs)
MobileNets
Bell & Gray 3 / 95
50
The SNAP Software ChallengeThe SNAP Software Challenge
1,000 discs = 10 Terrorbytes
100 Tape Transports = 1,000 tapes = 1 PetaByte
100 Nodes 1 Tips
Hig
h S
peed
Netw
ork (100 G
b/s)
• Cluster & Network OS.
• Automatic Administration
• Automatic data placement
• Automatic parallel programming
• Parallel Query Optimization
• Parallel concepts, algorithms, tools
• Execution Techniques load balance, checkpoint/restart,
Bell & Gray 3 / 95
51
OutlineOutline• Storage trends force pipeline & partition parallelism
– Lots of bytes & bandwidth per dollar
– Lots of latency
• Processor trends force pipeline & partition
– Lots of MIPS per dollar
– Lots of processors
• Putting it together (Scaleable Networks and Platforms)
– Build clusters of commodity processors & storage
– Commodity interconnect is key (S of PMS)» Traditional interconnects give 100k$/slice.
– Commodity Cluster Operating System is key
– Fault isolation and tolerance is key
– Automatic Parallel Programming is key