Computers for the Post-PC Era
description
Transcript of Computers for the Post-PC Era
Slide 1
Computers for the Post-PC Era
David PattersonUniversity of California at Berkeley
UC Berkeley IRAM Group UC Berkeley ISTORE Group
May 2000
Slide 2
Perspective on Post-PC Era• PostPC Era will be driven by 2 technologies:
1) “Gadgets”:Tiny Embedded or Mobile Devices–ubiquitous: in everything–e.g., successor to PDA,
cell phone, wearable computers
2) Infrastructure to Support such Devices–e.g., successor to Big Fat Web Servers, Database
Servers
Slide 3
VIRAM-1 Block Diagram
Slide 4
VIRAM-1: System on a ChipPrototype scheduled for tape-out mid 2000•0.18 um EDL process
•16 MB DRAM, 8 banks
•MIPS Scalar core and caches @ 200 MHz
•4 64-bit vector unit pipelines @ 200 MHz
•4 100 MB parallel I/O lines
•17x17 mm, 2 Watts
•25.6 GB/s memory (6.4 GB/s per direction and per Xbar)
•1.6 Gflops (64-bit), 6.4 GOPs (16-bit)
CPU+$
I/O4 Vector Pipes/Lanes
Memory (64 Mbits / 8 MBytes)
Memory (64 Mbits / 8 MBytes)
Xbar
Slide 5
Problem: General Element Permutation
• Hardware for a full vector permutation instruction (128 16b elements, 256b datapath)
• Datapath: 16 x 16 (x 16b) crossbar; scales by 0(N^2) • Control: 16 16-to-1 multiplexors; scales by 0(N*logN)
• Other problems– Consecutive result elements not written together;
time/energy wasted on wide vector register file port
16 16 16
0 1 15
01
15
Slide 6
Simple Vector Permutations
• Simple steps of butterfly permutations– A register provides the butterfly radix– Separate instructions for moving elements to
left/right
• Sufficient semantics for– Fast reductions of vector registers (dot products)– Fast FFT/DCT kernels
0 1 15
Slide 7
Hardware for Simple Permutations
• Hardware for 128 16b elements, 256b datapath
• Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16b, 32b only); Scales by O(N)
• Control: 6 control cases; scales by O(N)• Other benefits
– Consecutive result elements written together; – Buses used only for small radices
64
64
shift shift64 64
0 3
Slide 8
FFT: Straight forward
Problem: most time spent in shortvectors in later stages of FFT
Slide 9
FFT: Transpose inside Vector Regs
32b FlPt200223225238251264
Slide 10
FFT: Straight forward
Slide 11
VIRAM-1 Design Status• MIPS scalar core
– Synthesizable RTL code received from MIPS
– Cache RAMs to be compiled for IBM technology
– FPU RTL code almost compete
• Vector unit– RTL models for sub-blocks
developed; currently integrated and tested
– Control logic to be compiled for IBM technology
– Full-custom layout for multipliers/adders developed; layout for shifters to be developed
• Memory system– Synthesizable model for
DRAM controllers done– To be integrated with IBM
DRAM macros– Full-custom layout for
crossbar under development
• Testing infrastructure– Environment developed for
automatic test & validation– Directed tests for
single/multiple instruction groups developed
– Random instruction sequence generator developed
Slide 12
• Executes MIPS IV ISA single-precision FP instructions• Thirty-two 32-bit Floating Point Registers• Two 32-bit Control Registers• One 3-cycle (division takes 10 cycles) fully pipelined,
nearly full IEEE-754 compliant, execution unit (from Albert Ma@MIT)
• 6-stage pipeline (R-X-X-X-CDB-WB)• Support for partial out-of-order execution and precise
exceptions• Scalar Core dispatches FP instructions to FPU using an
interface that splits instructions into 3 classes:– Arithmetic instructions (ADD.S, SUB.S, MUL.S, DIV.S, ABS.S, NEG.S,
C.cond.S, CVT.S.W, CVT.W.S, TRUNC.W.S, MOV.S, MOVZ.S, MOVN.S)
– From Coprocessor Data Transfer instructions (SWC1, MFC1, CFC1)– To Coprocessor Data Transfer instructions (LWC1, MTC1, CTC1)
FPU Features
Slide 13
FPU Architecture
Slide 14
Multiplier Partitioning•64-bit multiplier built from 16-bit multiplier subblocks
•Subblocks combined with adders to perform larger multiplies
•Performs 2 simultaneous 32-bit multiplies by grouping 4 subblocks
•Performs 4 simultaneous 16-bit multiplies by using individual subblocks
•Unused blocks turned off to conserve power
Slide 15
FPU Current Status• Current Functionality
– Able to execute most instructions (all except C.cond.S, CFC1 and CTC1).
– Supports precise exception semantics.– Functionality verification.
» Used a random test generator that generates/kills instructions at random and compares the results from the RTL Verilog simulator against the results from an ISA Perl simulator.
• What remains to be done– Instructions that use the Control Registers (C.cond.S, CFC1 and
CTC1).– Exception generation.– Integrate execution pipeline with the rest of the design.– Synthesize, place and route.– Final assembly and verification of multiplier
• Performance– Sustainable Throughput: 1 instruction/cycle (assuming no data
hazards)– Instruction Latency: 6 cycles
Slide 16
UC-IBM Agreement• Biggest IRAM Obstacle:
Intellectual Property Agreement between University of California and IBM
• Can university accept free fab costs ($2.0M to $2.5M) in return for capped non-exclusive patent licensing fees for IBM if UC files for IRAM patents?
• Process started with IBM March 1999• IBM won’t give full process info until
contract• UC started negotiating seriously Jan 2000• Agreement June 1, 2000!
Slide 17
Other examples: IBM “Blue Gene”
• 1 PetaFLOPS in 2005 for $100M?• Application: Protein Folding• Blue Gene Chip
– 32 Multithreaded RISC processors + ??MB Embedded DRAM + high speed Network Interface on single 20 x 20 mm chip– 1 GFLOPS / processor
• 2’ x 2’ Board = 64 chips (2K CPUs)• Rack = 8 Boards (512 chips,16K CPUs) • System = 64 Racks (512 boards,32K chips,1M CPUs)• Total 1 million processors in just 2000 sq. ft.
Slide 18
Other examples: Sony Playstation 2
• Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5)
– Superscalar MIPS core + vector coprocessor + graphics/DRAM– Claim: “Toy Story” realism brought to games
Slide 19
Outline1) Example microprocessor for PostPC
gadgets
2) Motivation and the ISTORE project vision– AME: Availability, Maintainability, Evolutionary
growth
– ISTORE’s research principles
– Benchmarks for AME
• Conclusions and future work
Slide 20
Lampson: Systems Challenges• Systems that work
– Meeting their specs– Always available– Adapting to changing environment– Evolving while they run– Made from unreliable components– Growing without practical limit
• Credible simulations or analysis• Writing good specs• Testing• Performance
– Understanding when it doesn’t matter
“Computer Systems Research-Past and Future”
Keynote address, 17th SOSP,
Dec. 1999Butler Lampson
Microsoft
Slide 21
Hennessy: What Should the “New World” Focus Be?• Availability
– Both appliance & service• Maintainability
– Two functions:» Enhancing availability by preventing failure» Ease of SW and HW upgrades
• Scalability– Especially of service
• Cost– per device and per service transaction
• Performance– Remains important, but its not SPECint
“Back to the Future: Time to Return to Longstanding
Problems in Computer Systems?” Keynote address,
FCRC, May 1999
John HennessyStanford
Slide 22
The real scalability problems: AME
• Availability– systems should continue to meet quality of service
goals despite hardware and software failures
• Maintainability– systems should require only minimal ongoing
human administration, regardless of scale or complexity
• Evolutionary Growth– systems should evolve gracefully in terms of
performance, maintainability, and availability as they are grown/upgraded/expanded
• These are problems at today’s scales, and will only get worse as systems grow
Slide 23
Principles for achieving AME (1)
• No single points of failure• Redundancy everywhere• Performance robustness is more
important than peak performance– “performance robustness” implies that real-world
performance is comparable to best-case performance
• Performance can be sacrificed for improvements in AME– resources should be dedicated to AME
» compare: biological systems spend > 50% of resources on maintenance
– can make up performance by scaling system
Slide 24
Principles for achieving AME (2)
• Introspection– reactive techniques to detect and adapt to
failures, workload variations, and system evolution
– proactive techniques to anticipate and avert problems before they happen
Slide 25
ISTORE-1 hardware platform• 80-node x86-based cluster, 1.4TB storage
– cluster nodes are plug-and-play, intelligent, network-attached storage “bricks”
» a single field-replaceable unit to simplify maintenance
– each node is a full x86 PC w/256MB DRAM, 18GB disk– more CPU than NAS; fewer disks/node than cluster
ISTORE Chassis80 nodes, 8 per tray2 levels of switches•20 100 Mbit/s•2 1 Gbit/sEnvironment Monitoring:UPS, redundant PS,fans, heat and vibration sensors...
Intelligent Disk “Brick”Portable PC CPU: Pentium II/266 + DRAM
Redundant NICs (4 100 Mb/s links)Diagnostic Processor
Disk
Half-height canister
Slide 26
ISTORE-1 Status• 10 Nodes manufactured• Boots OS• Diagnostic Processor Interface SW
complete• PCB backplane: not yet designed• Finish 80 node system: Summer 2000
Slide 27
Hardware techniques• Fully shared-nothing cluster
organization– truly scalable architecture– architecture that tolerates partial failure– automatic hardware redundancy
Slide 28
Hardware techniques (2)• No Central Processor Unit:
distribute processing with storage– Serial lines, switches also growing with Moore’s
Law; less need today to centralize vs. bus oriented systems
– Most storage servers limited by speed of CPUs; why does this make sense?
– Why not amortize sheet metal, power, cooling infrastructure for disk to add processor, memory, and network?
– If AME is important, must provide resources to be used to help AME: local processors responsible for health and maintenance of their storage
Slide 29
Hardware techniques (3)• Heavily instrumented hardware
– sensors for temp, vibration, humidity, power, intrusion
– helps detect environmental problems before they can affect system integrity
• Independent diagnostic processor on each node– provides remote control of power, remote console
access to the node, selection of node boot code– collects, stores, processes environmental data for
abnormalities– non-volatile “flight recorder” functionality– all diagnostic processors connected via independent
diagnostic network
Slide 30
Hardware techniques (4)• On-demand network
partitioning/isolation– Internet applications must remain available
despite failures of components, therefore can isolate a subset for preventative maintenance
– Allows testing, repair of online system– Managed by diagnostic processor and network
switches via diagnostic network
Slide 31
Hardware techniques (5)• Built-in fault injection capabilities
– Power control to individual node components– Injectable glitches into I/O and memory busses– Managed by diagnostic processor – Used for proactive hardware introspection
» automated detection of flaky components» controlled testing of error-recovery mechanisms
– Important for AME benchmarking (see next slide)
Slide 32
“Hardware” techniques (6)• Benchmarking
– One reason for 1000X processor performance was ability to measure (vs. debate) which is better
» e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed?
– Need AME benchmarks“what gets measured gets done”“benchmarks shape a field”“quantification brings rigor”
Slide 33
Availability benchmark methodology• Goal: quantify variation in QoS metrics as
events occur that affect system availability• Leverage existing performance benchmarks
– to generate fair workloads– to measure & trace quality of service metrics
• Use fault injection to compromise system– hardware faults (disk, memory, network, power)– software faults (corrupt input, driver error returns)– maintenance events (repairs, SW/HW upgrades)
• Examine single-fault and multi-fault workloads– the availability analogues of performance micro- and
macro-benchmarks
Slide 34
Time (2-minute intervals)0 5 10 15 20 25 30 35 40 45 50 55 60
Performance160
170
180
190
200
210
}normal behavior(99% conf)
injecteddisk failure
reconstruction
• Results are most accessible graphically– plot change in QoS metrics over time– compare to “normal” behavior?
» 99% confidence intervals calculated from no-fault runs
Benchmark Availability?Methodology for reporting
results
Slide 35
Time (2-minute intervals)0 10 20 30 40 50 60 70 80 90 100 110
Hits per second140
150
160
170
180
190
200
210
data diskfaulted
reconstruction(manual)
sparefaulted
disks replaced
}normal behavior(99% conf)
Time (2-minute intervals)0 10 20 30 40 50 60 70 80 90 100 110
Hits per second140
150
160
170
180
190
200
210
220
data diskfaulted
reconstruction(automatic)
sparefaulted
reconstruction(automatic)
}normal behavior(99% conf)
disks replaced
Example results: multiple-faults
• Windows reconstructs ~3x faster than Linux• Windows reconstruction noticeably affects application
performance, while Linux reconstruction does not
Windows2000/IIS
Linux/Apache
Slide 36
Conclusions (1): ISTORE• Availability, Maintainability, and
Evolutionary growth are key challenges for server systems– more important even than performance
• ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers– via clusters of network-attached, computationally-
enhanced storage nodes running distributed code– via hardware and software introspection– we are currently performing application studies to
investigate and compare techniques• Availability benchmarks a powerful tool?
– revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000
Slide 37
Conclusions (2)• IRAM attractive for two Post-PC
applications because of low power, small size, high memory bandwidth– Gadgets: Embedded/Mobile devices– Infrastructure: Intelligent Storage and Networks
• PostPC infrastructure requires – New Goals: Availability, Maintainability, Evolution – New Principles: Introspection, Performance
Robustness– New Techniques: Isolation/fault insertion, Software
scrubbing– New Benchmarks: measure, compare AME metrics