bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52...
Transcript of bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52...
Index ■ I-3
AMD Barcelona microprocessor, Google WSC server, 467
AMD Fusion, L-52AMD K-5, L-30AMD Opteron
address translation, B-38Amazon Web Services, 457architecture, 15cache coherence, 361data cache example, B-12 to B-15,
B-13Google WSC servers, 468–469inclusion, 398manufacturing cost, 62misses per instruction, B-15MOESI protocol, 362multicore processor performance,
400–401multilevel exclusion, B-35NetApp FAS6000 filer, D-42paged virtual memory example,
B-54 to B-57vs. Pentium protection, B-57real-world server considerations,
52–55server energy savings, 25snooping limitations, 363–364SPEC benchmarks, 43TLB during address translation,
B-47AMD processors
architecture flaws vs. success, A-45GPU computing history, L-52power consumption, F-85recent advances, L-33RISC history, L-22shared-memory multiprogramming
workload, 378terminology, 313–315tournament predictors, 164Virtual Machines, 110VMMs, 129
Amortization of overhead, sorting case study, D-64 to D-67
AMPS, see Advanced mobile phone service (AMPS)
Andreessen, Marc, F-98Android OS, 324Annulling delayed branch,
instructions, K-25Antenna, radio receiver, E-23
Antialiasing, address translation, B-38Antidependences
compiler history, L-30 to L-31definition, 152finding, H-7 to H-8loop-level parallelism calculations,
320MIPS scoreboarding, C-72, C-79
Apogee Software, A-44Apollo DN 10000, L-30Apple iPad
ARM Cortex-A8, 114memory hierarchy basics, 78
Application binary interface (ABI), control flow instructions, A-20
Application layer, definition, F-82Applied Minds, L-74Arbitration algorithm
collision detection, F-23commercial interconnection
networks, F-56examples, F-49Intel SCCC, F-70interconnection networks, F-21 to
F-22, F-27, F-49 to F-50network impact, F-52 to F-55SAN characteristics, F-76switched-media networks, F-24switch microarchitecture, F-57 to
F-58switch microarchitecture
pipelining, F-60system area network history, F-100
Architect-compiler writer relationship, A-29 to A-30
Architecturally visible registers, register renaming vs. ROB, 208–209
Architectural Support for Compilers and Operating Systems (ASPLOS), L-11
Architecture, see also Computer architecture; CUDA (Compute Unified Device Architecture); Instruction set architecture (ISA); Vector architectures
compiler writer-architect relationship, A-29 to A-30
definition, 15heterogeneous, 262microarchitecture, 15–16, 247–254stack, A-3, A-27, A-44 to A-45
Areal density, disk storage, D-2Argument pointer, VAX, K-71Arithmetic intensity
as FP operation, 286, 286–288Roofline model, 326, 326–327
Arithmetic/logical instructionsdesktop RISCs, K-11, K-22embedded RISCs, K-15, K-24Intel 80x86, K-49, K-53SPARC, K-31VAX, B-73
Arithmetic-logical units (ALUs)ARM Cortex-A8, 234, 236basic MIPS pipeline, C-36branch condition evaluation, A-19data forwarding, C-40 to C-41data hazards requiring stalls, C-19
to C-20data hazard stall minimization,
C-17 to C-19DSP media extensions, E-10effective address cycle, C-6hardware-based execution, 185hardware-based speculation,
200–201, 201IA-64 instructions, H-35immediate operands, A-12integer division, J-54integer multiplication, J-48integer shifting over zeros, J-45 to
J-46Intel Core i7, 238ISA operands, A-4 to A-5ISA performance and efficiency
prediction, 241load interlocks, C-39microarchitectural techniques case
study, 253MIPS operations, A-35, A-37MIPS pipeline control, C-38 to C-39MIPS pipeline FP operations, C-52
to C-53MIPS R4000, C-65operand forwarding, C-19operands per instruction example,
A-6parallelism, 45
I-4 ■ Index
Arithmetic-logical units (continued )pipeline branch issues, C-39 to
C-41pipeline execution rate, C-10 to
C-11power/DLP issues, 322RISC architectures, K-5RISC classic pipeline, C-7RISC instruction set, C-4simple MIPS implementation,
C-31 to C-33TX-2, L-49
ARM (Advanced RISC Machine)addressing modes, K-5, K-6arithmetic/logical instructions,
K-15, K-24characteristics, K-4condition codes, K-12 to K-13constant extension, K-9control flow instructions, 14data transfer instructions, K-23embedded instruction format, K-8GPU computing history, L-52ISA class, 11memory addressing, 11multiply-accumulate, K-20operands, 12RISC instruction set lineage, K-43unique instructions, K-36 to K-37
ARM AMBA, OCNs, F-3ARM Cortex-A8
dynamic scheduling, 170ILP concepts, 148instruction decode, 234ISA performance and efficiency
prediction, 241–243memory access penalty, 117memory hierarchy design, 78,
114–117, 115memory performance, 115–117multibanked caches, 86overview, 233pipeline performance, 233–236,
235pipeline structure, 232processor comparison, 242way prediction, 81
ARM Cortex-A9vs. A8 performance, 236Tegra 2, mobile vs. server GPUs,
323–324, 324
ARM Thumbaddressing modes, K-6arithmetic/logical instructions,
K-24characteristics, K-4condition codes, K-14constant extension, K-9data transfer instructions, K-23embedded instruction format, K-8ISAs, 14multiply-accumulate, K-20RISC code size, A-23unique instructions, K-37 to K-38
ARPA (Advanced Research Project Agency)
LAN history, F-99 to F-100WAN history, F-97
ARPANET, WAN history, F-97 to F-98
Array multiplierexample, J-50integers, J-50multipass system, J-51
Arraysaccess age, 91blocking, 89–90bubble sort procedure, K-76cluster server outage/anomaly
statistics, 435examples, 90FFT kernel, I-7Google WSC servers, 469Layer 3 network linkage, 445loop interchange, 88–89loop-level parallelism
dependences, 318–319ocean application, I-9 to I-10recurrences, H-12WSC memory hierarchy, 445WSCs, 443
Array switch, WSCs, 443–444ASC, see Advanced Simulation and
Computing (ASC) program
ASCI, see Accelerated Strategic Computing Initiative (ASCI)
ASCII character format, 12, A-14ASC Purple, F-67, F-100ASI, see Advanced Switching
Interconnect (ASI)
ASPLOS, see Architectural Support for Compilers and Operating Systems (ASPLOS)
Assembly language, 2Association of Computing Machinery
(ACM), L-3Associativity, see also Set
associativitycache block, B-9 to B-10, B-10cache optimization, B-22 to B-24,
B-26, B-28 to B-30cloud computing, 460–461loop-level parallelism, 322multilevel inclusion, 398Opteron data cache, B-14shared-memory multiprocessors,
368Astronautics ZS-1, L-29Asynchronous events, exception
requirements, C-44 to C-45
Asynchronous I/O, storage systems, D-35
Asynchronous Transfer Mode (ATM)interconnection networks, F-89LAN history, F-99packet format, F-75total time statistics, F-90VOQs, F-60as WAN, F-79WAN history, F-98WANs, F-4
ATA (Advanced Technology Attachment) disks
Berkeley’s Tertiary Disk project, D-12
disk storage, D-4historical background, L-81power, D-5RAID 6, D-9server energy savings, 25
Atanasoff, John, L-5Atanasoff Berry Computer (ABC), L-5ATI Radeon 9700, L-51Atlas computer, L-9ATM, see Asynchronous Transfer
Mode (ATM)ATM systems
server benchmarks, 41TP benchmarks, D-18
Index ■ I-5
Atomic exchangelock implementation, 389–390synchronization, 387–388
Atomic instructionsbarrier synchronization, I-14Core i7, 329Fermi GPU, 308T1 multithreading unicore
performance, 229Atomicity-consistency-isolation-durab
ility (ACID), vs. WSC storage, 439
Atomic operationscache coherence, 360–361snooping cache coherence
implementation, 365“Atomic swap,” definition, K-20Attributes field, IA-32 descriptor
table, B-52Autoincrement deferred addressing,
VAX, K-67Autonet, F-48Availability
commercial interconnection networks, F-66
computer architecture, 11, 15computer systems, D-43 to D-44,
D-44data on Internet, 344fault detection, 57–58I/O system design/evaluation,
D-36loop-level parallelism, 217–218mainstream computing classes, 5modules, 34open-source software, 457RAID systems, 60as server characteristic, 7servers, 16source operands, C-74WSCs, 8, 433–435, 438–439
Average instruction execution time, L-6
Average Memory Access Time (AMAT)
block size calculations, B-26 to B-28
cache optimizations, B-22, B-26 to B-32, B-36
cache performance, B-16 to B-21calculation, B-16 to B-17
centralized shared-memory architectures, 351–352
definition, B-30 to B-31memory hierarchy basics, 75–76miss penalty reduction, B-32via miss rates, B-29, B-29 to B-30as processor performance
predictor, B-17 to B-20Average reception factor
centralized switched networks, F-32
multi-device interconnection networks, F-26
AVX, see Advanced Vector Extensions (AVX)
AWS, see Amazon Web Services (AWS)
BBack-off time, shared-media
networks, F-23Backpressure, congestion
management, F-65Backside bus, centralized
shared-memory multiprocessors, 351
Balanced systems, sorting case study, D-64 to D-67
Balanced tree, MINs with nonblicking, F-34
Bandwidth, see also Throughputarbitration, F-49and cache miss, B-2 to B-3centralized shared-memory
multiprocessors, 351–352
communication mechanism, I-3congestion management, F-64 to
F-65Cray Research T3D, F-87DDR DRAMS and DIMMS, 101definition, F-13DSM architecture, 379Ethernet and bridges, F-78FP arithmetic, J-62GDRAM, 322–323GPU computation, 327–328GPU Memory, 327ILP instruction fetch
basic considerations, 202–203branch-target buffers, 203–206
integrated units, 207–208return address predictors,
206–207interconnection networks, F-28
multi-device networks, F-25 to F-29
performance considerations, F-89
two-device networks, F-12 to F-20
vs. latency, 18–19, 19memory, and vector performance,
332memory hierarchy, 126network performance and
topology, F-41OCN history, F-103performance milestones, 20point-to-point links and switches,
D-34routing, F-50 to F-52routing/arbitration/switching
impact, F-52shared- vs. switched-media
networks, F-22SMP limitations, 363switched-media networks, F-24system area network history, F-101vs. TCP/IP reliance, F-95and topology, F-39vector load/store units, 276–277WSC memory hierarchy, 443–444,
444Bandwidth gap, disk storage, D-3Banerjee, Uptal, L-30 to L-31Bank busy time, vector memory
systems, G-9Banked memory, see also Memory
banksand graphics memory, 322–323vector architectures, G-10
Banks, Fermi GPUs, 297Barcelona Supercomputer Center,
F-76Barnes
characteristics, I-8 to I-9distributed-memory
multiprocessor, I-32symmetric shared-memory
multiprocessors, I-22, I-23, I-25
I-6 ■ Index
Barnes-Hut n-body algorithm, basic concept, I-8 to I-9
Barrierscommercial workloads, 370Cray X1, G-23fetch-and-increment, I-20 to I-21hardware primitives, 387large-scale multiprocessor
synchronization, I-13 to I-16, I-14, I-16, I-19, I-20
synchronization, 298, 313, 329BARRNet, see Bay Area Research
Network (BARRNet)Based indexed addressing mode, Intel
80x86, K-49, K-58Base field, IA-32 descriptor table,
B-52 to B-53Base station
cell phones, E-23wireless networks, E-22
Basic block, ILP, 149Batch processing workloads
WSC goals/requirements, 433WSC MapReduce and Hadoop,
437–438Bay Area Research Network
(BARRNet), F-80BBN Butterfly, L-60BBN Monarch, L-60Before rounding rule, J-36Benchmarking, see also specific
benchmark suitesdesktop, 38–40EEMBC, E-12embedded applications
basic considerations, E-12power consumption and
efficiency, E-13fallacies, 56instruction set operations, A-15as performance measurement,
37–41real-world server considerations,
52–55response time restrictions, D-18server performance, 40–41sorting case study, D-64 to D-67
Benes topologycentralized switched networks,
F-33
example, F-33BER, see Bit error rate (BER)Berkeley’s Tertiary Disk project
failure statistics, D-13overview, D-12system log, D-43
Berners-Lee, Tim, F-98Bertram, Jack, L-28Best-case lower bounds, multi-device
interconnection networks, F-25
Best-case upper boundsmulti-device interconnection
networks, F-26network performance and
topology, F-41Between instruction exceptions,
definition, C-45Biased exponent, J-15Bidirectional multistage
interconnection networks
Benes topology, F-33characteristics, F-33 to F-34SAN characteristics, F-76
Bidirectional rings, topology, F-35 to F-36
Big Endianinterconnection networks, F-12memory address interpretation,
A-7MIPS core extensions, K-20 to
K-21MIPS data transfers, A-34
Bigtable (Google), 438, 441BINAC, L-5Binary code compatibility
embedded systems, E-15VLIW processors, 196
Binary-coded decimal, definition, A-14Binary-to-decimal conversion, FP
precisions, J-34Bing search
delays and user behavior, 451latency effects, 450–452WSC processor cost-performance,
473Bisection bandwidth
as network cost constraint, F-89network performance and
topology, F-41
NEWS communication, F-42topology, F-39
Bisection bandwidth, WSC array switch, 443
Bisection traffic fraction, network performance and topology, F-41
Bit error rate (BER), wireless networks, E-21
Bit rot, case study, D-61 to D-64Bit selection, block placement, B-7Black box network
basic concept, F-5 to F-6effective bandwidth, F-17performance, F-12switched-media networks, F-24switched network topologies, F-40
Block addressingblock identification, B-7 to B-8interleaved cache banks, 86memory hierarchy basics, 74
Blocked floating point arithmetic, DSP, E-6
Block identificationmemory hierarchy considerations,
B-7 to B-9virtual memory, B-44 to B-45
Blockingbenchmark fallacies, 56centralized switched networks,
F-32direct networks, F-38HOL, see Head-of-line (HOL)
blockingnetwork performance and
topology, F-41Blocking calls, shared-memory
multiprocessor workload, 369
Blocking factor, definition, 90Block multithreading, definition,
L-34Block offset
block identification, B-7 to B-8cache optimization, B-38definition, B-7 to B-8direct-mapped cache, B-9example, B-9main memory, B-44Opteron data cache, B-13, B-13 to
B-14
Index ■ I-7
Block placementmemory hierarchy considerations,
B-7virtual memory, B-44
Block replacementmemory hierarchy considerations,
B-9 to B-10virtual memory, B-45
Blocks, see also Cache block; Thread Block
ARM Cortex-A8, 115vs. bytes per reference, 378compiler optimizations, 89–90definition, B-2disk array deconstruction, D-51,
D-55disk deconstruction case study,
D-48 to D-51global code scheduling, H-15 to
H-16L3 cache size, misses per
instruction, 371LU kernel, I-8memory hierarchy basics, 74memory in cache, B-61placement in main memory,
B-44RAID performance prediction,
D-57 to D-58TI TMS320C55 DSP, E-8uncached state, 384
Block servers, vs. filers, D-34 to D-35Block size
vs. access time, B-28memory hierarchy basics, 76vs. miss rate, B-27
Block transfer engine (BLT)Cray Research T3D, F-87interconnection network
protection, F-87BLT, see Block transfer engine (BLT)Body of Vectorized Loop
definition, 292, 313GPU hardware, 295–296, 311GPU Memory structure, 304NVIDIA GPU, 296SIMD Lane Registers, 314Thread Block Scheduler, 314
Boggs, David, F-99BOMB, L-4Booth recoding, J-8 to J-9, J-9, J-10 to
J-11
chip comparison, J-60 to J-61integer multiplication, J-49
Bose-Einstein formula, definition, 30Bounds checking, segmented virtual
memory, B-52Branch byte, VAX, K-71Branch delay slot
characteristics, C-23 to C-25control hazards, C-41MIPS R4000, C-64scheduling, C-24
Branchescanceling, C-24 to C-25conditional branches, 300–303,
A-17, A-19 to A-20, A-21
control flow instructions, A-16, A-18
delayed, C-23delay slot, C-65IBM 360, K-86 to K-87instructions, K-25MIPS control flow instructions,
A-38MIPS operations, A-35nullifying, C-24 to C-25RISC instruction set, C-5VAX, K-71 to K-72WCET, E-4
Branch folding, definition, 206Branch hazards
basic considerations, C-21penalty reduction, C-22 to C-25pipeline issues, C-39 to C-42scheme performance, C-25 to C-26stall reduction, C-42
Branch history table, basic scheme, C-27 to C-30
Branch offsets, control flow instructions, A-18
Branch penaltyexamples, 205instruction fetch bandwidth,
203–206reduction, C-22 to C-25simple scheme examples, C-25
Branch predictionaccuracy, C-30branch cost reduction, 162–167correlation, 162–164cost reduction, C-26dynamic, C-27 to C-30
early schemes, L-27 to L-28ideal processor, 214ILP exploitation, 201instruction fetch bandwidth, 205integrated instruction fetch units,
207Intel Core i7, 166–167, 239–241misprediction rates on SPEC89, 166static, C-26 to C-27trace scheduling, H-19two-bit predictor comparison, 165
Branch-prediction buffers, basic considerations, C-27 to C-30, C-29
Branch registersIA-64, H-34PowerPC instructions, K-32 to K-33
Branch stalls, MIPS R4000 pipeline, C-67
Branch-target addressbranch hazards, C-42MIPS control flow instructions,
A-38MIPS pipeline, C-36, C-37MIPS R4000, C-25pipeline branches, C-39RISC instruction set, C-5
Branch-target buffersARM Cortex-A8, 233branch hazard stalls, C-42example, 203instruction fetch bandwidth,
203–206instruction handling, 204MIPS control flow instructions,
A-38Branch-target cache, see Branch-target
buffersBrewer, Eric, L-73Bridges
and bandwidth, F-78definition, F-78
Bubblesand deadlock, F-47routing comparison, F-54stall as, C-13
Bubble sort, code example, K-76Buckets, D-26Buffered crossbar switch, switch
microarchitecture, F-62Buffered wormhole switching,
F-51
I-8 ■ Index
Buffersbranch-prediction, C-27 to C-30,
C-29branch-target, 203–206, 204, 233,
A-38, C-42DSM multiprocessor cache
coherence, I-38 to I-40Intel SCCC, F-70
interconnection networks, F-10 to F-11
memory, 208MIPS scoreboarding, C-74network interface functions, F-7ROB, 184–192, 188–189, 199,
208–210, 238switch microarchitecture, F-58 to
F-60TLB, see Translation lookaside
buffer (TLB)translation buffer, B-45 to B-46write buffer, B-11, B-14, B-32,
B-35 to B-36Bundles
IA-64, H-34 to H-35, H-37Itanium 2, H-41
Burks, Arthur, L-3Burroughs B5000, L-16Bus-based coherent multiprocessors,
L-59 to L-60Buses
barrier synchronization, I-16cache coherence, 391centralized shared-memory
multiprocessors, 351definition, 351dynamic scheduling with
Tomasulo’s algorithm, 172, 175
Google WSC servers, 469I/O bus replacements, D-34, D-34large-scale multiprocessor
synchronization, I-12 to I-13
NEWS communication, F-42scientific workloads on symmetric
shared-memory multiprocessors, I-25
Sony PlayStation 2 Emotion Engine, E-18
vs. switched networks, F-2switch microarchitecture, F-55 to
F-56
Tomasulo’s algorithm, 180, 182Bypassing, see also Forwarding
data hazards requiring stalls, C-19 to C-20
dynamically scheduled pipelines, C-70 to C-71
MIPS R4000, C-65SAN example, F-74
Byte displacement addressing, VAX, K-67
Byte offsetmisaligned addresses, A-8PTX instructions, 300
Bytesaligned/misaligned addresses, A-8arithmetic intensity example, 286Intel 80x86 integer operations, K-51memory address interpretation,
A-7 to A-8MIPS data transfers, A-34MIPS data types, A-34operand types/sizes, A-14per reference, vs. block size, 378
Byte/word/long displacement deferred addressing, VAX, K-67
CCAC, see Computer aided design
(CAD) toolsCache bandwidth
caches, 78multibanked caches, 85–86nonblocking caches, 83–85pipelined cache access, 82
Cache blockAMD Opteron data cache, B-13,
B-13 to B-14cache coherence protocol, 357–358compiler optimizations, 89–90critical word first, 86–87definition, B-2directory-based cache coherence
protocol, 382–386, 383false sharing, 366GPU comparisons, 329inclusion, 397–398memory block, B-61miss categories, B-26miss rate reduction, B-26 to B-28scientific workloads on symmetric
shared-memory
multiprocessors, I-22, I-25, I-25
shared-memory multiprogramming workload, 375–377, 376
way prediction, 81write invalidate protocol
implementation, 356–357
write strategy, B-10Cache coherence
advanced directory protocol case study, 420–426
basic considerations, 112–113Cray X1, G-22directory-based, see
Directory-based cache coherence
enforcement, 354–355extensions, 362–363hardware primitives, 388Intel SCCC, F-70large-scale multiprocessor history,
L-61large-scale multiprocessors
deadlock and buffering, I-38 to I-40
directory controller, I-40 to I-41
DSM implementation, I-36 to I-37
overview, I-34 to I-36latency hiding with speculation,
396lock implementation, 389–391mechanism, 358memory hierarchy basics, 75multiprocessor-optimized
software, 409multiprocessors, 352–353protocol definitions, 354–355single-chip multicore processor
case study, 412–418single memory location example,
352snooping, see Snooping cache
coherencestate diagram, 361steps and bus traffic examples, 391write-back cache, 360
Cache definition, B-2Cache hit
AMD Opteron example, B-14
Index ■ I-9
definition, B-2example calculation, B-5
Cache latency, nonblocking cache, 83–84
Cache missand average memory access time,
B-17 to B-20block replacement, B-10definition, B-2distributed-memory
multiprocessors, I-32example calculations, 83–84Intel Core i7, 122interconnection network, F-87large-scale multiprocessors, I-34 to
I-35nonblocking cache, 84single vs. multiple thread
executions, 228WCET, E-4
Cache-only memory architecture (COMA), L-61
Cache optimizationsbasic categories, B-22basic optimizations, B-40case studies, 131–133compiler-controlled prefetching,
92–95compiler optimizations, 87–90critical word first, 86–87energy consumption, 81hardware instruction prefetching,
91–92, 92hit time reduction, B-36 to B-40miss categories, B-23 to B-26miss penalty reduction
via multilevel caches, B-30 to B-35
read misses vs. writes, B-35 to B-36
miss rate reductionvia associativity, B-28 to B-30via block size, B-26 to B-28via cache size, B-28
multibanked caches, 85–86, 86nonblocking caches, 83–85, 84overview, 78–79pipelined cache access, 82simple first-level caches, 79–80techniques overview, 96way prediction, 81–82write buffer merging, 87, 88
Cache organizationblocks, B-7, B-8Opteron data cache, B-12 to B-13,
B-13optimization, B-19performance impact, B-19
Cache performanceaverage memory access time, B-16
to B-20basic considerations, B-3 to B-6,
B-16basic equations, B-22basic optimizations, B-40cache optimization, 96case study, 131–133example calculation, B-16 to B-17out-of-order processors, B-20 to
B-22prediction, 125–126
Cache prefetch, cache optimization, 92Caches, see also Memory hierarchy
access time vs. block size, B-28AMD Opteron example, B-12 to
B-15, B-13, B-15basic considerations, B-48 to B-49coining of term, L-11definition, B-2early work, L-10embedded systems, E-4 to E-5Fermi GPU architecture, 306ideal processor, 214ILP for realizable processors,
216–218Itanium 2, H-42multichip multicore
multiprocessor, 419parameter ranges, B-42Sony PlayStation 2 Emotion
Engine, E-18vector processors, G-25vs. virtual memory, B-42 to B-43
Cache sizeand access time, 77AMD Opteron example, B-13 to
B-14energy consumption, 81highly parallel memory systems,
133memory hierarchy basics, 76misses per instruction, 126, 371miss rate, B-24 to B-25vs. miss rate, B-27
miss rate reduction, B-28multilevel caches, B-33and relative execution time, B-34scientific workloads
distributed-memory multiprocessors, I-29 to I-31
symmetric shared-memory multiprocessors, I-22 to I-23, I-24
shared-memory multiprogramming workload, 376
virtually addressed, B-37CACTI
cache optimization, 79–80, 81memory access times, 77
Caller saving, control flow instructions, A-19 to A-20
Call gateIA-32 segment descriptors, B-53segmented virtual memory, B-54
Callscompiler structure, A-25 to A-26control flow instructions, A-17,
A-19 to A-21CUDA Thread, 297dependence analysis, 321high-level instruction set, A-42 to
A-43Intel 80x86 integer operations,
K-51invocation options, A-19ISAs, 14MIPS control flow instructions,
A-38MIPS registers, 12multiprogrammed workload,
378NVIDIA GPU Memory structures,
304–305return address predictors, 206shared-memory multiprocessor
workload, 369user-to-OS gates, B-54VAX, K-71 to K-72
Canceling branch, branch delay slots, C-24 to C-25
Canonical form, AMD64 paged virtual memory, B-55
Capabilities, protection schemes, L-9 to L-10
I-10 ■ Index
Capacity missesblocking, 89–90and cache size, B-24definition, B-23memory hierarchy basics, 75scientific workloads on symmetric
shared-memory multiprocessors, I-22, I-23, I-24
shared-memory workload, 373CAPEX, see Capital expenditures
(CAPEX)Capital expenditures (CAPEX)
WSC costs, 452–455, 453WSC Flash memory, 475WSC TCO case study, 476–478
Carrier sensing, shared-media networks, F-23
Carrier signal, wireless networks, E-21
Carry condition code, MIPS core, K-9 to K-16
Carry-in, carry-skip adder, J-42Carry-lookahead adder (CLA)
chip comparison, J-60early computer arithmetic, J-63example, J-38integer addition speedup, J-37 to
J-41with ripple-carry adder, J-42tree, J-40 to J-41
Carry-outcarry-lookahead circuit, J-38floating-point addition speedup,
J-25Carry-propagate adder (CPA)
integer multiplication, J-48, J-51multipass array multiplier, J-51
Carry-save adder (CSA)integer division, J-54 to J-55integer multiplication, J-47 to J-48,
J-48Carry-select adder
characteristics, J-43 to J-44chip comparison, J-60example, J-43
Carry-skip adder (CSA)characteristics, J-41 to J43example, J-42, J-44
CAS, see Column access strobe (CAS)Case statements
control flow instruction addressing modes, A-18
return address predictors, 206Case studies
advanced directory protocol, 420–426
cache optimization, 131–133cell phones
block diagram, E-23Nokia circuit board, E-24overview, E-20radio receiver, E-23standards and evolution, E-25wireless communication
challenges, E-21wireless networks, E-21 to
E-22chip fabrication cost, 61–62computer system power
consumption, 63–64directory-based coherence,
418–420dirty bits, D-61 to D-64disk array deconstruction, D-51 to
D-55, D-52 to D-55disk deconstruction, D-48 to D-51,
D-50highly parallel memory systems,
133–136instruction set principles, A-47 to
A-54I/O subsystem design, D-59 to D-61memory hierarchy, B-60 to B-67microarchitectural techniques,
247–254pipelining example, C-82 to C-88RAID performance prediction,
D-57 to D-59RAID reconstruction, D-55 to
D-57Sanyo VPC-SX500 digital camera,
E-19single-chip multicore processor,
412–418Sony PlayStation 2 Emotion
Engine, E-15 to E-18sorting, D-64 to D-67vector kernel on vector processor
and GPU, 334–336WSC resource allocation, 478–479WSC TCO, 476–478
CCD, see Charge-coupled device (CCD)
C/C++ languagedependence analysis, H-6GPU computing history, L-52hardware impact on software
development, 4integer division/remainder, J-12loop-level parallelism
dependences, 318, 320–321
NVIDIA GPU programming, 289return address predictors, 206
CDB, see Common data bus (CDB)CDC, see Control Data Corporation
(CDC)CDF, datacenter, 487CDMA, see Code division multiple
access (CDMA)Cedar project, L-60Cell, Barnes-Hut n-body algorithm,
I-9Cell phones
block diagram, E-23embedded system case study
characteristics, E-22 to E-24overview, E-20radio receiver, E-23standards and evolution, E-25wireless network overview,
E-21 to E-22Flash memory, D-3GPU features, 324Nokia circuit board, E-24wireless communication
challenges, E-21wireless networks, E-22
Centralized shared-memory multiprocessors
basic considerations, 351–352basic structure, 346–347, 347cache coherence, 352–353cache coherence enforcement,
354–355cache coherence example,
357–362cache coherence extensions,
362–363invalidate protocol
implementation, 356–357
Index ■ I-11
SMP and snooping limitations, 363–364
snooping coherence implementation, 365–366
snooping coherence protocols, 355–356
Centralized switched networksexample, F-31routing algorithms, F-48topology, F-30 to F-34, F-31
Centrally buffered switch, microarchitecture, F-57
Central processing unit (CPU)Amdahl’s law, 48average memory access time, B-17cache performance, B-4coarse-grained multithreading, 224early pipelined versions, L-26 to
L-27exception stopping/restarting, C-47extensive pipelining, C-81Google server usage, 440GPU computing history, L-52vs. GPUs, 288instruction set complications, C-50MIPS implementation, C-33 to
C-34MIPS precise exceptions, C-59 to
C-60MIPS scoreboarding, C-77performance measurement history,
L-6pipeline branch issues, C-41pipelining exceptions, C-43 to
C-46pipelining performance, C-10Sony PlayStation 2 Emotion
Engine, E-17SPEC server benchmarks, 40TI TMS320C55 DSP, E-8vector memory systems, G-10
Central processing unit (CPU) timeexecution time, 36modeling, B-18processor performance
calculations, B-19 to B-21
processor performance equation, 49–51
processor performance time, 49Cerf, Vint, F-97
CERN, see European Center for Particle Research (CERN)
CFM, see Current frame pointer (CFM)
Chainingconvoys, DAXPY code, G-16vector processor performance,
G-11 to G-12, G-12VMIPS, 268–269
Channel adapter, see Network interface
Channels, cell phones, E-24Character
floating-point performance, A-2as operand type, A-13 to A-14operand types/sizes, 12
Charge-coupled device (CCD), Sanyo VPC-SX500 digital camera, E-19
Checksumdirty bits, D-61 to D-64packet format, F-7
ChillersGoogle WSC, 466, 468WSC containers, 464WSC cooling systems, 448–449
Chimedefinition, 309GPUs vs. vector architectures, 308multiple lanes, 272NVIDIA GPU computational
structures, 296vector chaining, G-12vector execution time, 269, G-4vector performance, G-2vector sequence calculations, 270
Chip-crossing wire delay, F-70OCN history, F-103
Chipkillmemory dependability, 104–105WSCs, 473
Choke packets, congestion management, F-65
Chunkdisk array deconstruction, D-51Shear algorithm, D-53
CIFS, see Common Internet File System (CIFS)
Circuit switchingcongestion management, F-64 to
F-65
interconnected networks, F-50Circulating water system (CWS)
cooling system design, 448WSCs, 448
CISC, see Complex Instruction Set Computer (CISC)
CLA, see Carry-lookahead adder (CLA)
Clean block, definition, B-11Climate Savers Computing Initiative,
power supply efficiencies, 462
Clock cyclesbasic MIPS pipeline, C-34 to C-35and branch penalties, 205cache performance, B-4FP pipeline, C-66and full associativity, B-23GPU conditional branching, 303ILP exploitation, 197, 200ILP exposure, 157instruction fetch bandwidth,
202–203instruction steps, 173–175Intel Core i7 branch predictor, 166MIPS exceptions, C-48MIPS pipeline, C-52MIPS pipeline FP operations, C-52
to C-53MIPS scoreboarding, C-77miss rate calculations, B-31 to B-32multithreading approaches,
225–226pipelining performance, C-10processor performance equation, 49RISC classic pipeline, C-7Sun T1 multithreading, 226–227switch microarchitecture
pipelining, F-61vector architectures, G-4vector execution time, 269vector multiple lanes, 271–273VLIW processors, 195
Clock cycles per instruction (CPI)addressing modes, A-10ARM Cortex-A8, 235branch schemes, C-25 to C-26,
C-26cache behavior impact, B-18 to
B-19cache hit calculation, B-5data hazards requiring stalls, C-20
I-12 ■ Index
Clock cycles per instruction (continued)extensive pipelining, C-81floating-point calculations, 50–52ILP concepts, 148–149, 149ILP exploitation, 192Intel Core i7, 124, 240, 240–241microprocessor advances, L-33MIPS R4000 performance, C-69miss penalty reduction, B-32multiprocessing/
multithreading-based performance, 398–400
multiprocessor communication calculations, 350
pipeline branch issues, C-41pipeline with stalls, C-12 to C-13pipeline structural hazards, C-15 to
C-16pipelining concept, C-3processor performance
calculations, 218–219processor performance time, 49–51and processor speed, 244RISC history, L-21shared-memory workloads,
369–370simple MIPS implementation,
C-33 to C-34structural hazards, C-13Sun T1 multithreading unicore
performance, 229Sun T1 processor, 399Tomasulo’s algorithm, 181VAX 8700 vs. MIPS M2000, K-82
Clock cycle timeand associativity, B-29average memory access time, B-21
to B-22cache optimization, B-19 to B-20,
B-30cache performance, B-4CPU time equation, 49–50, B-18MIPS implementation, C-34miss penalties, 219pipeline performance, C-12, C-14
to C-15pipelining, C-3shared- vs. switched-media
networks, F-25Clock periods, processor performance
equation, 48–49Clock rate
DDR DRAMS and DIMMS, 101ILP for realizable processors, 218Intel Core i7, 236–237microprocessor advances, L-33microprocessors, 24MIPS pipeline FP operations, C-53multicore processor performance,
400and processor speed, 244
Clocks, processor performance equation, 48–49
Clock skew, pipelining performance, C-10
Clock tickscache coherence, 391processor performance equation,
48–49Clos network
Benes topology, F-33as nonblocking, F-33
Cloud computingbasic considerations, 455–461clusters, 345provider issues, 471–472utility computing history, L-73 to
L-74Clusters
characteristics, 8, I-45cloud computing, 345as computer class, 5containers, L-74 to L-75Cray X1, G-22Google WSC servers, 469historical background, L-62 to
L-64IBM Blue Gene/L, I-41 to I-44,
I-43 to I-44interconnection network domains,
F-3 to F-4Internet Archive Cluster, see
Internet Archive Clusterlarge-scale multiprocessors, I-6large-scale multiprocessor trends,
L-62 to L-63outage/anomaly statistics, 435power consumption, F-85utility computing, L-73 to L-74as WSC forerunners, 435–436,
L-72 to L-73WSC storage, 442–443
Cm*, L-56C.mmp, L-56
CMOSDRAM, 99first vector computers, L-46, L-48ripple-carry adder, J-3vector processors, G-25 to G-27
Coarse-grained multithreading, definition, 224–226
Cocke, John, L-19, L-28Code division multiple access (CDMA),
cell phones, E-25Code generation
compiler structure, A-25 to A-26, A-30
dependences, 220general-purpose register
computers, A-6ILP limitation studies, 220loop unrolling/scheduling, 162
Code schedulingexample, H-16parallelism, H-15 to H-23superblock scheduling, H-21 to
H-23, H-22trace scheduling, H-19 to H-21, H-20
Code sizearchitect-compiler considerations,
A-30benchmark information, A-2comparisons, A-44flawless architecture design, A-45instruciton set encoding, A-22 to A-23ISA and compiler technology,
A-43 to A-44loop unrolling, 160–161multiprogramming, 375–376PMDs, 6RISCs, A-23 to A-24VAX design, A-45VLIW model, 195–196
Coefficient of variance, D-27Coerced exceptions
definition, C-45exception types, C-46
Coherence, see Cache coherenceCoherence misses
definition, 366multiprogramming, 376–377role, 367scientific workloads on symmetric
shared-memory multiprocessors, I-22
snooping protocols, 355–356
Index ■ I-13
Cold-start misses, definition, B-23Collision, shared-media networks, F-23Collision detection, shared-media
networks, F-23Collision misses, definition, B-23Collocation sites, interconnection
networks, F-85COLOSSUS, L-4Column access strobe (CAS), DRAM,
98–99Column major order
blocking, 89stride, 278
COMA, see Cache-only memory architecture (COMA)
Combining tree, large-scale multiprocessor synchronization, I-18
Command queue depth, vs. disk throughput, D-4
Commercial interconnection networkscongestion management, F-64 to
F-66connectivity, F-62 to F-63cross-company interoperability,
F-63 to F-64DECstation 5000 reboots, F-69fault tolerance, F-66 to F-69
Commercial workloadsexecution time distribution, 369symmetric shared-memory
multiprocessors, 367–374
Commit stage, ROB instruction, 186–187, 188
CommoditiesAmazon Web Services, 456–457array switch, 443cloud computing, 455cost vs. price, 32–33cost trends, 27–28, 32Ethernet rack switch, 442HPC hardware, 436shared-memory multiprocessor,
441WSCs, 441
Commodity cluster, characteristics, I-45
Common data bus (CDB)dynamic scheduling with
Tomasulo’s algorithm, 172, 175
FP unit with Tomasulo’s algorithm, 185
reservation stations/register tags, 177
Tomasulo’s algorithm, 180, 182Common Internet File System (CIFS),
D-35NetApp FAS6000 filer, D-41 to
D-42Communication bandwidth, basic
considerations, I-3Communication latency, basic
considerations, I-3 to I-4Communication latency hiding, basic
considerations, I-4Communication mechanism
adaptive routing, F-93 to F-94internetworking, F-81 to F-82large-scale multiprocessors
advantages, I-4 to I-6metrics, I-3 to I-4
multiprocessor communication calculations, 350
network interfaces, F-7 to F-8NEWS communication, F-42 to
F-43SMP limitations, 363
Communication protocol, definition, F-8
Communication subnets, see Interconnection networks
Communication subsystems, see Interconnection networks
Compare instruction, VAX, K-71Compares, MIPS core, K-9 to K-16Compare-select-store unit (CSSU), TI
TMS320C55 DSP, E-8Compiler-controlled prefetching, miss
penalty/rate reduction, 92–95
Compiler optimizationsblocking, 89–90cache optimization, 131–133compiler assumptions, A-25 to
A-26and consistency model, 396loop interchange, 88–89miss rate reduction, 87–90passes, A-25performance impact, A-27
types and classes, A-28Compiler scheduling
data dependences, 151definition, C-71hardware support, L-30 to L-31IBM 360 architecture, 171
Compiler speculation, hardware supportmemory references, H-32overview, H-27preserving exception behavior,
H-28 to H-32Compiler techniques
dependence analysis, H-7global code scheduling, H-17 to
H-18ILP exposure, 156–162vectorization, G-14vector sparse matrices, G-12
Compiler technologyand architecture decisions, A-27 to
A-29Cray X1, G-21 to G-22ISA and code size, A-43 to A-44multimedia instruction support,
A-31 to A-32register allocation, A-26 to A-27structure, A-24 to A-26, A-25
Compiler writer-architect relationship, A-29 to A-30
Complex Instruction Set Computer (CISC)
RISC history, L-22VAX as, K-65
Compulsory missesand cache size, B-24definition, B-23memory hierarchy basics, 75shared-memory workload, 373
Computation-to-communication ratiosparallel programs, I-10 to I-12scaling, I-11
Compute-optimized processors, interconnection networks, F-88
Computer aided design (CAD) tools, cache optimization, 79–80
Computer architecture, see also Architecture
coining of term, K-83 to K-84computer design innovations, 4defining, 11
I-14 ■ Index
Computer architecture (continued)definition, L-17 to L-18exceptions, C-44factors in improvement, 2flawless design, K-81flaws and success, K-81floating-point addition, rules, J-24goals/functions requirements, 15,
15–16, 16high-level language, L-18 to L-19instruction execution issues, K-81ISA, 11–15multiprocessor software
development, 407–409parallel, 9–10WSC basics, 432, 441–442
array switch, 443memory hierarchy, 443–446storage, 442–443
Computer arithmeticchip comparison, J-58, J-58 to
J-61, J-59 to J-60floating point
exceptions, J-34 to J-35fused multiply-add, J-32 to J-33IEEE 754, J-16iterative division, J-27 to J-31and memory bandwidth, J-62overview, J-13 to J-14precisions, J-33 to J-34remainder, J-31 to J-32special values, J-16special values and denormals,
J-14 to J-15underflow, J-36 to J-37, J-62
floating-point additiondenormals, J-26 to J-27overview, J-21 to J-25speedup, J-25 to J-26
floating-point multiplicationdenormals, J-20 to J-21examples, J-19overview, J-17 to J-20rounding, J-18
integer addition speedupcarry-lookahead, J-37 to J-41carry-lookahead circuit, J-38carry-lookahead tree, J-40carry-lookahead tree adder,
J-41carry-select adder, J-43, J-43 to
J-44, J-44
carry-skip adder, J-41 to J43, J-42
overview, J-37integer arithmetic
language comparison, J-12overflow, J-11Radix-2 multiplication/
division, J-4, J-4 to J-7
restoring/nonrestoring division, J-6
ripply-carry addition, J-2 to J-3, J-3
signed numbers, J-7 to J-10systems issues, J-10 to J-13
integer divisionradix-2 division, J-55radix-4 division, J-56radix-4 SRT division, J-57with single adder, J-54 to J-58SRT division, J-45 to J-47, J-46
integer-FP conversions, J-62integer multiplication
array multiplier, J-50Booth recoding, J-49even/odd array, J-52with many adders, J-50 to J-54multipass array multiplier, J-51signed-digit addition table,
J-54with single adder, J-47 to J-49,
J-48Wallace tree, J-53
integer multiplication/division, shifting over zeros, J-45 to J-47
overview, J-2rounding modes, J-20
Computer chip fabricationcost case study, 61–62Cray X1E, G-24
Computer classesdesktops, 6embedded computers, 8–9example, 5overview, 5parallelism and parallel
architectures, 9–10PMDs, 6servers, 7and system characteristics, E-4warehouse-scale computers, 8
Computer design principlesAmdahl’s law, 46–48common case, 45–46parallelism, 44–45principle of locality, 45processor performance equation,
48–52Computer history, technology and
architecture, 2–5Computer room air-conditioning
(CRAC), WSC infrastructure, 448–449
Compute tiles, OCNs, F-3Compute Unified Device Architecture,
see CUDA (Compute Unified Device Architecture)
Conditional branchesbranch folding, 206compare frequencies, A-20compiler performance, C-24 to
C-25control flow instructions, 14, A-16,
A-17, A-19, A-21desktop RISCs, K-17embedded RISCs, K-17evaluation, A-19global code scheduling, H-16, H-16GPUs, 300–303ideal processor, 214ISAs, A-46MIPS control flow instructions,
A-38, A-40MIPS core, K-9 to K-16PA-RISC instructions, K-34, K-34predictor misprediction rates, 166PTX instruction set, 298–299static branch prediction, C-26types, A-20vector-GPU comparison, 311
Conditional instructionsexposing parallelism, H-23 to H-27limitations, H-26 to H-27
Condition codesbranch conditions, A-19control flow instructions, 14definition, C-5high-level instruction set, A-43instruction set complications, C-50MIPS core, K-9 to K-16pipeline branch penalties, C-23VAX, K-71
Index ■ I-15
Conflict missesand block size, B-28cache coherence mechanism, 358and cache size, B-24, B-26definition, B-23as kernel miss, 376L3 caches, 371memory hierarchy basics, 75OLTP workload, 370PIDs, B-37shared-memory workload, 373
Congestion controlcommercial interconnection
networks, F-64system area network history, F-101
Congestion management, commercial interconnection networks, F-64 to F-66
Connectednessdimension-order routing, F-47 to
F-48interconnection network topology,
F-29Connection delay, multi-device
interconnection networks, F-25
Connection Machine CM-5, F-91, F-100
Connection Multiprocessor 2, L-44, L-57
Consistency, see Memory consistencyConstant extension
desktop RISCs, K-9embedded RISCs, K-9
Constellation, characteristics, I-45Containers
airflow, 466cluster history, L-74 to L-75Google WSCs, 464–465, 465
Context Switchingdefinition, 106, B-49Fermi GPU, 307
Control bits, messages, F-6Control Data Corporation (CDC), first
vector computers, L-44 to L-45
Control Data Corporation (CDC) 6600computer architecture definition,
L-18dynamically scheduling with
scoreboard, C-71 to C-72
early computer arithmetic, J-64first dynamic scheduling, L-27MIPS scoreboarding, C-75, C-77multiple-issue processor
development, L-28multithreading history, L-34RISC history, L-19
Control Data Corporation (CDC) STAR-100
first vector computers, L-44peak performance vs. start-up
overhead, 331Control Data Corporation (CDC)
STAR processor, G-26Control dependences
conditional instructions, H-24as data dependence, 150global code scheduling, H-16hardware-based speculation,
183ILP, 154–156ILP hardware model, 214and Tomasulo’s algorithm, 170vector mask registers, 275–276
Control flow instructionsaddressing modes, A-17 to A-18basic considerations, A-16 to
A-17, A-20 to A-21classes, A-17conditional branch options, A-19conditional instructions, H-27hardware vs. software speculation,
221Intel 80x86 integer operations, K-51ISAs, 14MIPS, A-37 to A-38, A-38procedure invocation options,
A-19 to A-20Control hazards
ARM Cortex-A8, 235definition, C-11
Control instructionsIntel 80x86, K-53RISCs
desktop systems, K-12, K-22embedded systems, K-16
VAX, B-73Controllers, historical background,
L-80 to L-81Controller transitions
directory-based, 422snooping cache, 421
Control Processordefinition, 309GPUs, 333SIMD, 10Thread Block Scheduler, 294vector processor, 310, 310–311vector unit structure, 273
Conventional datacenters, vs. WSCs, 436
Convex Exemplar, L-61Convex processors, vector processor
history, G-26Convolution, DSP, E-5Convoy
chained, DAXPY code, G-16DAXPY on VMIPS, G-20strip-mined loop, G-5vector execution time, 269–270vector starting times, G-4
Conway, Lynn, L-28Cooling systems
Google WSC, 465–468mechanical design, 448WSC infrastructure, 448–449
Copper wiringEthernet, F-78interconnection networks, F-9
“Coprocessor operations,” MIPS core extensions, K-21
Copy propagation, definition, H-10 to H-11
Core definition, 15Core plus ASIC, embedded systems,
E-3Correlating branch predictors, branch
costs, 162–163Cosmic Cube, F-100, L-60Cost
Amazon EC2, 458Amazon Web Services, 457bisection bandwidth, F-89branch predictors, 162–167, C-26chip fabrication case study, 61–62cloud computing providers,
471–472disk storage, D-2DRAM/magnetic disk, D-3interconnecting node calculations,
F-31 to F-32, F-35Internet Archive Cluster, D-38 to
D-40internetworking, F-80
I-16 ■ Index
Cost (continued )I/O system design/evaluation,
D-36magnetic storage history, L-78MapReduce calculations, 458–459,
459memory hierarchy design, 72MINs vs. direct networks, F-92multiprocessor cost relationship,
409multiprocessor linear speedup, 407network topology, F-40PMDs, 6server calculations, 454, 454–455server usage, 7SIMD supercomputer
development, L-43speculation, 210torus topology interconnections,
F-36 to F-38tournament predictors, 164–166WSC array switch, 443WSC vs. datacenters, 455–456WSC efficiency, 450–452WSC facilities, 472WSC network bottleneck, 461WSCs, 446–450, 452–455, 453WSCs vs. servers, 434WSC TCO case study, 476–478
Cost associativity, cloud computing, 460–461
Cost-performancecommercial interconnection
networks, F-63computer trends, 3extensive pipelining, C-80 to C-81IBM eServer p5 processor, 409sorting case study, D-64 to D-67WSC Flash memory, 474–475WSC goals/requirements, 433WSC hardware inactivity, 474WSC processors, 472–473
Cost trendsintegrated circuits, 28–32manufacturing vs. operation, 33overview, 27vs. price, 32–33time, volume, commoditization,
27–28Count register, PowerPC instructions,
K-32 to K-33CP-67 program, L-10
CPA, see Carry-propagate adder (CPA)
CPI, see Clock cycles per instruction (CPI)
CPU, see Central processing unit (CPU)
CRAC, see Computer room air-conditioning (CRAC)
Cray, Seymour, G-25, G-27, L-44, L-47
Cray-1first vector computers, L-44 to L-45peak performance vs. start-up
overhead, 331pipeline depths, G-4RISC history, L-19vector performance, 332vector performance measures, G-16as VMIPS basis, 264, 270–271,
276–277Cray-2
DRAM, G-25first vector computers, L-47tailgating, G-20
Cray-3, G-27Cray-4, G-27Cray C90
first vector computers, L-46, L-48vector performance calculations,
G-8Cray J90, L-48Cray Research T3D, F-86 to F-87,
F-87Cray supercomputers, early computer
arithmetic, J-63 to J-64Cray T3D, F-100, L-60Cray T3E, F-67, F-94, F-100, L-48,
L-60Cray T90, memory bank calculations,
276Cray X1
cluster history, L-63first vector computers, L-46, L-48MSP module, G-22, G-23 to G-24overview, G-21 to G-23peak performance, 58
Cray X1E, F-86, F-91characteristics, G-24
Cray X2, L-46 to L-47first vector computers, L-48 to
L-49
Cray X-MP, L-45first vector computers, L-47
Cray XT3, L-58, L-63Cray XT3 SeaStar, F-63Cray Y-MP
first vector computers, L-45 to L-47
parallel processing debates, L-57vector architecture programming,
281, 281–282CRC, see Cyclic redundancy check
(CRC)Create vector index instruction (CVI),
sparse matrices, G-13Credit-based control flow
InfiniBand, F-74interconnection networks, F-10,
F-17CRISP, L-27Critical path
global code scheduling, H-16trace scheduling, H-19 to H-21, H-20
Critical word first, cache optimization, 86–87
Crossbarscentralized switched networks,
F-30, F-31characteristics, F-73Convex Exemplar, L-61HOL blocking, F-59OCN history, F-104switch microarchitecture, F-62switch microarchitecture
pipelining, F-60 to F-61, F-61
VMIPS, 265Crossbar switch
centralized switched networks, F-30interconnecting node calculations,
F-31 to F-32Cross-company interoperability,
commercial interconnection networks, F-63 to F-64
Crusoe, L-31Cryptanalysis, L-4CSA, see Carry-save adder (CSA);
Carry-skip adder (CSA)C# language, hardware impact on
software development, 4CSSU, see Compare-select-store unit
(CSSU)
Index ■ I-17
CUDA (Compute Unified Device Architecture)
GPU computing history, L-52GPU conditional branching, 303GPUs vs. vector architectures,
310NVIDIA GPU programming,
289PTX, 298, 300sample program, 289–290SIMD instructions, 297terminology, 313–315
CUDA ThreadCUDA programming model, 300,
315definition, 292, 313definitions and terms, 314GPU data addresses, 310GPU Memory structures, 304NVIDIA parallelism, 289–290vs. POSIX Threads, 297PTX Instructions, 298SIMD Instructions, 303Thread Block, 313
Current frame pointer (CFM), IA-64 register model, H-33 to H-34
Custom clustercharacteristics, I-45IBM Blue Gene/L, I-41 to I-44,
I-43 to I-44Cut-through packet switching, F-51
routing comparison, F-54CVI, see Create vector index
instruction (CVI)CWS, see Circulating water system
(CWS)CYBER 180/990, precise exceptions,
C-59CYBER 205
peak performance vs. start-up overhead, 331
vector processor history, G-26 to G-27
CYBER 250, L-45Cycles, processor performance
equation, 49Cycle time, see also Clock cycle time
CPI calculations, 350pipelining, C-81scoreboarding, C-79vector processors, 277
Cyclic redundancy check (CRC)IBM Blue Gene/L 3D torus
network, F-73network interface, F-8
Cydrome Cydra 6, L-30, L-32
DDaCapo benchmarks
ISA, 242SMT, 230–231, 231
DAMQs, see Dynamically allocatable multi-queues (DAMQs)
DASH multiprocessor, L-61Database program speculation, via
multiple branches, 211Data cache
ARM Cortex-A8, 236cache optimization, B-33, B-38cache performance, B-16GPU Memory, 306ISA, 241locality principle, B-60MIPS R4000 pipeline, C-62 to
C-63multiprogramming, 374page level write-through, B-56RISC processor, C-7structural hazards, C-15TLB, B-46
Data cache missapplications vs. OS, B-59cache optimization, B-25Intel Core i7, 240Opteron, B-12 to B-15sizes and associativities, B-10writes, B-10
Data cache size, multiprogramming, 376–377
DatacentersCDF, 487containers, L-74cooling systems, 449layer 3 network example, 445PUE statistics, 451tier classifications, 491vs. WSC costs, 455–456WSC efficiency measurement,
450–452vs. WSCs, 436
Data dependencesconditional instructions, H-24data hazards, 167–168
dynamically scheduling with scoreboard, C-71
example calculations, H-3 to H-4hazards, 153–154ILP, 150–152ILP hardware model, 214–215ILP limitation studies, 220vector execution time, 269
Data fetchingARM Cortex-A8, 234directory-based cache coherence
protocol example, 382–383
dynamically scheduled pipelines, C-70 to C-71
ILP, instruction bandwidthbasic considerations, 202–203branch-target buffers, 203–206return address predictors,
206–207MIPS R4000, C-63snooping coherence protocols,
355–356Data flow
control dependence, 154–156dynamic scheduling, 168global code scheduling, H-17ILP limitation studies, 220limit, L-33
Data flow execution, hardware-based speculation, 184
Datagrams, see PacketsData hazards
ARM Cortex-A8, 235basic considerations, C-16definition, C-11dependences, 152–154dynamic scheduling, 167–176
basic concept, 168–170examples, 176–178Tomasulo’s algorithm,
170–176, 178–179Tomasulo’s algorithm
loop-based example, 179–181
ILP limitation studies, 220instruction set complications, C-50
to C-51microarchitectural techniques case
study, 247–254MIPS pipeline, C-71RAW, C-57 to C-58
I-18 ■ Index
Data hazardsstall minimization by forwarding,
C-16 to C-19, C-18stall requirements, C-19 to C-21VMIPS, 264
Data-level parallelism (DLP)definition, 9GPUs
basic considerations, 288basic PTX thread instructions,
299conditional branching, 300–303coprocessor relationship,
330–331Fermi GPU architecture
innovations, 305–308Fermi GTX 480 floorplan, 295mapping examples, 293Multimedia SIMD comparison,
312multithreaded SIMD Processor
block diagram, 294NVIDIA computational
structures, 291–297NVIDIA/CUDA and AMD
terminology, 313–315NVIDIA GPU ISA, 298–300NVIDIA GPU Memory
structures, 304, 304–305programming, 288–291SIMD thread scheduling, 297terminology, 292vs. vector architectures,
308–312, 310from ILP, 4–5Multimedia SIMD Extensions
basic considerations, 282–285programming, 285roofline visual performance
model, 285–288, 287and power, 322vector architecture
basic considerations, 264gather/scatter operations,
279–280multidimensional arrays,
278–279multiple lanes, 271–273peak performance vs. start-up
overhead, 331programming, 280–282
vector execution time, 268–271vector-length registers,
274–275vector load-store unit
bandwidth, 276–277vector-mask registers, 275–276vector processor example,
267–268VMIPS, 264–267
vector kernel implementation, 334–336
vector performance and memory bandwidth, 332
vector vs. scalar performance, 331–332
WSCs vs. servers, 433–434Data link layer
definition, F-82interconnection networks, F-10
Data parallelism, SIMD computer history, L-55
Data-race-free, synchronized programs, 394
Data races, synchronized programs, 394Data transfers
cache miss rate calculations, B-16computer architecture, 15desktop RISC instructions, K-10,
K-21embedded RISCs, K-14, K-23gather-scatter, 281, 291instruction operators, A-15Intel 80x86, K-49, K-53 to K-54ISA, 12–13MIPS, addressing modes, A-34MIPS64, K-24 to K-26MIPS64 instruction subset, A-40MIPS64 ISA formats, 14MIPS core extensions, K-20MIPS operations, A-36 to A-37MMX, 283multimedia instruction compiler
support, A-31operands, A-12PTX, 305SIMD extensions, 284“typical” programs, A-43VAX, B-73vector vs. GPU, 300
Data trunks, MIPS scoreboarding, C-75
Data typesarchitect-compiler writer
relationship, A-30dependence analysis, H-10desktop computing, A-2Intel 80x86, K-50MIPS, A-34, A-36MIPS64 architecture, A-34multimedia compiler support, A-31operand types/sizes, A-14 to A-15SIMD Multimedia Extensions,
282–283SPARC, K-31VAX, K-66, K-70
Dauber, Phil, L-28DAXPY loop
chained convoys, G-16on enhanced VMIPS, G-19 to G-21memory bandwidth, 332MIPS/VMIPS calculations,
267–268peak performance vs. start-up
overhead, 331vector performance measures,
G-16VLRs, 274–275on VMIPS, G-19 to G-20VMIPS calculations, G-18VMIPS on Linpack, G-18VMIPS peak performance, G-17
D-cachescase study examples, B-63way prediction, 81–82
DDR, see Double data rate (DDR)Deadlock
cache coherence, 361dimension-order routing, F-47 to
F-48directory protocols, 386Intel SCCC, F-70large-scale multiprocessor cache
coherence, I-34 to I-35, I-38 to I-40
mesh network routing, F-46network routing, F-44routing comparison, F-54synchronization, 388system area network history, F-101
Deadlock avoidancemeshes and hypercubes, F-47routing, F-44 to F-45
Index ■ I-19
Deadlock recovery, routing, F-45Dead time
vector pipeline, G-8vector processor, G-8
Decimal operands, formats, A-14Decimal operations, PA-RISC
instructions, K-35Decision support system (DSS),
shared-memory workloads, 368–369, 369, 369–370
Decoder, radio receiver, E-23Decode stage, TI 320C55 DSP, E-7DEC PDP-11, address space, B-57 to
B-58DECstation 5000, reboot
measurements, F-69DEC VAX
addressing modes, A-10 to A-11, A-11, K-66 to K-68
address space, B-58architect-compiler writer
relationship, A-30branch conditions, A-19branches, A-18
jumps, procedure calls, K-71 to K-72
bubble sort, K-76characteristics, K-42cluster history, L-62, L-72compiler writing-architecture
relationship, A-30control flow instruction branches,
A-18data types, K-66early computer arithmetic, J-63 to
J-64early pipelined CPUs, L-26exceptions, C-44extensive pipelining, C-81failures, D-15flawless architecture design, A-45,
K-81high-level instruction set, A-41 to
A-43high-level language computer
architecture, L-18 to L-19history, 2–3immediate value distribution, A-13instruction classes, B-73instruction encoding, K-68 to
K-70, K-69
instruction execution issues, K-81instruction operator categories,
A-15instruction set complications, C-49
to C-50integer overflow, J-11vs. MIPS, K-82vs. MIPS32 sort, K-80vs. MIPS code, K-75miss rate vs. virtual addressing,
B-37operands, K-66 to K-68operand specifiers, K-68operands per ALU, A-6, A-8operand types/sizes, A-14operation count, K-70 to K-71operations, K-70 to K-72operators, A-15overview, K-65 to K-66precise exceptions, C-59replacement by RISC, 2RISC history, L-20 to L-21RISC instruction set lineage, K-43sort, K-76 to K-79sort code, K-77 to K-79sort register allocation, K-76swap, K-72 to K-76swap code, B-74, K-72, K-74swap full procedure, K-75 to K-76swap and register preservation,
B-74 to B-75unique instructions, K-28
DEC VAX-11/780, L-6 to L-7, L-11, L-18
DEC VAX 8700vs. MIPS M2000, K-82, L-21RISC history, L-21
Dedicated link networkblack box network, F-5 to F-6effective bandwidth, F-17example, F-6
Defect tolerance, chip fabrication cost case study, 61–62
Deferred addressing, VAX, K-67Delayed branch
basic scheme, C-23compiler history, L-31instructions, K-25stalls, C-65
Dell Poweredge servers, prices, 53Dell Poweredge Thunderbird, SAN
characteristics, F-76
Dell serverseconomies of scale, 456real-world considerations, 52–55WSC services, 441
Demodulator, radio receiver, E-23Denormals, J-14 to J-16, J-20 to
J-21floating-point additions, J-26 to
J-27floating-point underflow, J-36
Dense matrix multiplication, LU kernel, I-8
Density-optimized processors, vs. SPEC-optimized, F-85
Dependabilitybenchmark examples, D-21 to
D-23, D-22definition, D-10 to D-11disk operators, D-13 to D-15integrated circuits, 33–36Internet Archive Cluster, D-38 to
D-40memory systems, 104–105WSC goals/requirements, 433WSC memory, 473–474WSC storage, 442–443
Dependence analysisbasic approach, H-5example calculations, H-7limitations, H-8 to H-9
Dependence distance, loop-carried dependences, H-6
Dependencesantidependences, 152, 320, C-72,
C-79CUDA, 290as data dependence, 150data hazards, 167–168definition, 152–153, 315–316dynamically scheduled pipelines,
C-70 to C-71dynamically scheduling with
scoreboard, C-71dynamic scheduling with
Tomasulo’s algorithm, 172
hardware-based speculation, 183
hazards, 153–154ILP, 150–156ILP hardware model, 214–215ILP limitation studies, 220
I-20 ■ Index
Dependences (continued)loop-level parallelism, 318–322,
H-3dependence analysis, H-6 to H-10
MIPS scoreboarding, C-79as program properties, 152sparse matrices, G-13and Tomasulo’s algorithm, 170types, 150vector execution time, 269vector mask registers, 275–276VMIPS, 268
Dependent computations, elimination, H-10 to H-12
Descriptor privilege level (DPL), segmented virtual memory, B-53
Descriptor table, IA-32, B-52Design faults, storage systems, D-11Desktop computers
characteristics, 6compiler structure, A-24as computer class, 5interconnection networks, F-85memory hierarchy basics, 78multimedia support, E-11multiprocessor importance, 344performance benchmarks, 38–40processor comparison, 242RAID history, L-80RISC systems
addressing modes, K-5addressing modes and
instruction formats, K-5 to K-6
arithmetic/logical instructions, K-22
conditional branches, K-17constant extension, K-9control instructions, K-12conventions, K-13data transfer instructions, K-10,
K-21examples, K-3, K-4features, K-44FP instructions, K-13, K-23instruction formats, K-7multimedia extensions, K-16 to
K-19, K-18system characteristics, E-4
Destination offset, IA-32 segment, B-53
Deterministic routing algorithmvs. adaptive routing, F-52 to F-55,
F-54DOR, F-46
Diesembedded systems, E-15integrated circuits, 28–30, 29Nehalem floorplan, 30wafer example, 31, 31–32
Die yield, basic equation, 30–31Digital Alpha
branches, A-18conditional instructions, H-27early pipelined CPUs, L-27RISC history, L-21RISC instruction set lineage, K-43synchronization history, L-64
Digital Alpha 21064, L-48Digital Alpha 21264
cache hierarchy, 368floorplan, 143
Digital Alpha MAXcharacteristics, K-18multimedia support, K-18
Digital Alpha processorsaddressing modes, K-5arithmetic/logical instructions, K-11branches, K-21conditional branches, K-12, K-17constant extension, K-9control flow instruction branches,
A-18conventions, K-13data transfer instructions, K-10displacement addressing mode,
A-12exception stopping/restarting, C-47FP instructions, K-23immediate value distribution, A-13MAX, multimedia support, E-11MIPS precise exceptions, C-59multimedia support, K-19recent advances, L-33as RISC systems, K-4shared-memory workload,
367–369unique instructions, K-27 to K-29
Digital Linear Tape, L-77Digital signal processor (DSP)
cell phones, E-23, E-23, E-23 to E-24
definition, E-3
desktop multimedia support, E-11embedded RISC extensions, K-19examples and characteristics, E-6media extensions, E-10 to E-11overview, E-5 to E-7saturating operations, K-18 to
K-19TI TMS320C6x, E-8 to E-10TI TMS320C6x instruction packet,
E-10TI TMS320C55, E-6 to E-7, E-7 to
E-8TI TMS320C64x, E-9
Dimension-order routing (DOR), definition, F-46
DIMMs, see Dual inline memory modules (DIMMs)
Direct attached disks, definition, D-35Direct-mapped cache
address parts, B-9address translation, B-38block placement, B-7early work, L-10memory hierarchy basics, 74memory hierarchy, B-48optimization, 79–80
Direct memory access (DMA)historical background, L-81InfiniBand, F-76network interface functions,
F-7Sanyo VPC-SX500 digital camera,
E-19Sony PlayStation 2 Emotion
Engine, E-18TI TMS320C55 DSP, E-8zero-copy protocols, F-91
Direct networkscommercial system topologies,
F-37vs. high-dimensional networks,
F-92vs. MIN costs, F-92topology, F-34 to F-40
Directory-based cache coherenceadvanced directory protocol case
study, 420–426basic considerations, 378–380case study, 418–420definition, 354distributed-memory
multiprocessor, 380
Index ■ I-21
large-scale multiprocessor history, L-61
latencies, 425protocol basics, 380–382protocol example, 382–386state transition diagram, 383
Directory-based multiprocessorcharacteristics, I-31performance, I-26scientific workloads, I-29synchronization, I-16, I-19 to I-20
Directory controller, cache coherence, I-40 to I-41
Dirty bitcase study, D-61 to D-64definition, B-11virtual memory fast address
translation, B-46Dirty block
definition, B-11read misses, B-36
Discrete cosine transform, DSP, E-5Disk arrays
deconstruction case study, D-51 to D-55, D-52 to D-55
RAID 6, D-8 to D-9RAID 10, D-8RAID levels, D-6 to D-8, D-7
Disk layout, RAID performance prediction, D-57 to D-59
Disk power, basic considerations, D-5Disk storage
access time gap, D-3areal density, D-2 to D-5cylinders, D-5deconstruction case study, D-48 to
D-51, D-50DRAM/magnetic disk cost vs.
access time, D-3intelligent interfaces, D-4internal microprocessors, D-4real faults and failures, D-10 to
D-11throughput vs. command queue
depth, D-4Disk technology
failure rate calculation, 48Google WSC servers, 469performance trends, 19–20, 20WSC Flash memory, 474–475
Dispatch stageinstruction steps, 174
microarchitectural techniques case study, 247–254
Displacement addressing modebasic considerations, A-10MIPS, 12MIPS data transfers, A-34MIPS instruction format, A-35value distributions, A-12VAX, K-67
Display lists, Sony PlayStation 2 Emotion Engine, E-17
Distributed routing, basic concept, F-48
Distributed shared memory (DSM)basic considerations, 378–380basic structure, 347–348, 348characteristics, I-45directory-based cache coherence,
354, 380, 418–420multichip multicore
multiprocessor, 419snooping coherence protocols,
355Distributed shared-memory
multiprocessorscache coherence implementation,
I-36 to I-37scientific application performance,
I-26 to I-32, I-28 to I-32Distributed switched networks,
topology, F-34 to F-40Divide operations
chip comparison, J-60 to J-61floating-point, stall, C-68floating-point iterative, J-27 to
J-31integers, speedup
radix-2 division, J-55radix-4 division, J-56radix-4 SRT division, J-57with single adder, J-54 to J-58
integer shifting over zeros, J-45 to J-47
language comparison, J-12n-bit unsigned integers, J-4PA-RISC instructions, K-34 to
K-35Radix-2, J-4 to J-7restoring/nonrestoring, J-6SRT division, J-45 to J-47, J-46unfinished instructions, 179
DLP, see Data-level parallelism (DLP)
DLXinteger arithmetic, J-12vs. Intel 80x86 operations, K-62,
K-63 to K-64DMA, see Direct memory access
(DMA)DOR, see Dimension-order routing
(DOR)Double data rate (DDR)
ARM Cortex-A8, 117DRAM performance, 100DRAMs and DIMMS, 101Google WSC servers, 468–469IBM Blue Gene/L, I-43InfiniBand, F-77Intel Core i7, 121SDRAMs, 101
Double data rate 2 (DDR2), SDRAM timing diagram, 139
Double data rate 3 (DDR3)DRAM internal organization, 98GDRAM, 102Intel Core i7, 118SDRAM power consumption, 102,
103Double data rate 4 (DDR4), DRAM,
99Double data rate 5 (DDR5), GDRAM,
102Double-extended floating-point
arithmetic, J-33 to J-34Double failures, RAID reconstruction,
D-55 to D-57Double-precision floating point
add-divide, C-68AVX for x86, 284chip comparison, J-58data access benchmarks, A-15DSP media extensions, E-10 to
E-11Fermi GPU architecture, 306floating-point pipeline, C-65GTX 280, 325, 328–330IBM 360, 171MIPS, 285, A-38 to A-39MIPS data transfers, A-34MIPS registers, 12, A-34Multimedia SIMD vs. GPUs, 312operand sizes/types, 12as operand type, A-13 to A-14operand usage, 297pipeline timing, C-54
I-22 ■ Index
Double-precision (continued )Roofline model, 287, 326SIMD Extensions, 283VMIPS, 266, 266–267
Double roundingFP precisions, J-34FP underflow, J-37
Double wordsaligned/misaligned addresses, A-8data access benchmarks, A-15Intel 80x86, K-50memory address interpretation,
A-7 to A-8MIPS data types, A-34operand types/sizes, 12, A-14stride, 278
DPL, see Descriptor privilege level (DPL)
DRAM, see Dynamic random-access memory (DRAM)
DRDRAM, Sony PlayStation 2, E-16 to E-17
Driver domains, Xen VM, 111DSM, see Distributed shared memory
(DSM)DSP, see Digital signal processor
(DSP)DSS, see Decision support system
(DSS)Dual inline memory modules (DIMMs)
clock rates, bandwidth, names, 101DRAM basics, 99Google WSC server, 467Google WSC servers, 468–469graphics memory, 322–323Intel Core i7, 118, 121Intel SCCC, F-70SDRAMs, 101WSC memory, 473–474
Dual SIMD Thread Scheduler, example, 305–306
DVFS, see Dynamic voltage-frequency scaling (DVFS)
Dynamically allocatable multi-queues (DAMQs), switch microarchitecture, F-56 to F-57
Dynamically scheduled pipelinesbasic considerations, C-70 to C-71with scoreboard, C-71 to C-80
Dynamically shared libraries, control flow instruction addressing modes, A-18
Dynamic energy, definition, 23Dynamic network reconfiguration,
fault tolerance, F-67 to F-68
Dynamic powerenergy efficiency, 211microprocessors, 23vs. static power, 26
Dynamic random-access memory (DRAM)
bandwidth issues, 322–323characteristics, 98–100clock rates, bandwidth, names, 101cost vs. access time, D-3cost trends, 27Cray X1, G-22CUDA, 290dependability, 104disk storage, D-3 to D-4embedded benchmarks, E-13errors and faults, D-11first vector computers, L-45, L-47Flash memory, 103–104Google WSC servers, 468–469GPU SIMD instructions, 296IBM Blue Gene/L, I-43 to I-44improvement over time, 17integrated circuit costs, 28Intel Core i7, 121internal organization, 98magnetic storage history, L-78memory hierarchy design, 73, 73memory performance, 100–102multibanked caches, 86NVIDIA GPU Memory structures,
305performance milestones, 20power consumption, 63real-world server considerations,
52–55Roofline model, 286server energy savings, 25Sony PlayStation 2, E-16, E-17speed trends, 99technology trends, 17vector memory systems, G-9vector processor, G-25WSC efficiency measurement, 450
WSC memory costs, 473–474WSC memory hierarchy, 444–445WSC power modes, 472yield, 32
Dynamic schedulingfirst use, L-27ILP
basic concept, 168–169definition, 168example and algorithms,
176–178with multiple issue and
speculation, 197–202overcoming data hazards,
167–176Tomasulo’s algorithm, 170–176,
178–179, 181–183MIPS scoreboarding, C-79SMT on superscalar processors, 230and unoptimized code, C-81
Dynamic voltage-frequency scaling (DVFS)
energy efficiency, 25Google WSC, 467processor performance equation,
52Dynamo (Amazon), 438, 452
EEarly restart, miss penalty reduction,
86Earth Simulator, L-46, L-48, L-63EBS, see Elastic Block Storage (EBS)EC2, see Amazon Elastic Computer
Cloud (EC2)ECC, see Error-Correcting Code
(ECC)Eckert, J. Presper, L-2 to L-3, L-5, L-19Eckert-Mauchly Computer
Corporation, L-4 to L-5, L-56
ECL minicomputer, L-19Economies of scale
WSC vs. datacenter costs, 455–456WSCs, 434
EDSAC (Electronic Delay Storage Automatic Calculator), L-3
EDVAC (Electronic Discrete Variable Automatic Computer), L-2 to L-3
Index ■ I-23
EEMBC, see Electronic Design News Embedded Microprocessor Benchmark Consortium (EEMBC)
EEPROM (Electronically Erasable Programmable Read-Only Memory)
compiler-code size considerations, A-44
Flash Memory, 102–104memory hierarchy design, 72
Effective addressALU, C-7, C-33data dependences, 152definition, A-9execution/effective address cycle,
C-6, C-31 to C-32, C-63
hardware-based speculation, 186, 190, 192
load interlocks, C-39load-store, 174, 176, C-4RISC instruction set, C-4 to C-5simple MIPS implementation,
C-31 to C-32simple RISC implementation,
C-6TLB, B-49Tomasulo’s algorithm, 173, 178,
182Effective bandwidth
definition, F-13example calculations, F-18vs. interconnected nodes, F-28interconnection networks
multi-device networks, F-25 to F-29
two-device networks, F-12 to F-20
vs. packet size, F-19Efficiency factor, F-52Eight-way set associativity
ARM Cortex-A8, 114cache optimization, B-29conflict misses, B-23data cache misses, B-10
Elapsed time, execution time, 36Elastic Block Storage (EBS),
MapReduce cost calculations, 458–460, 459
Electronically Erasable Programmable Read-Only Memory, see EEPROM (Electronically Erasable Programmable Read-Only Memory)
Electronic Delay Storage Automatic Calculator (EDSAC), L-3
Electronic Design News Embedded Microprocessor Benchmark Consortium (EEMBC)
benchmark classes, E-12ISA code size, A-44kernel suites, E-12performance benchmarks, 38power consumption and efficiency
metrics, E-13Electronic Discrete Variable
Automatic Computer (EDVAC), L-2 to L-3
Electronic Numerical Integrator and Calculator (ENIAC), L-2 to L-3, L-5 to L-6, L-77
Element group, definition, 272Embedded multiprocessors,
characteristics, E-14 to E-15
Embedded systemsbenchmarks
basic considerations, E-12power consumption and
efficiency, E-13cell phone case study
Nokia circuit board, E-24overview, E-20phone block diagram, E-23phone characteristics, E-22 to
E-24radio receiver, E-23standards and evolution, E-25wireless networks, E-21 to
E-22characteristics, 8–9, E-4as computer class, 5digital signal processors
definition, E-3desktop multimedia support,
E-11examples and characteristics,
E-6
media extensions, E-10 to E-11overview, E-5 to E-7TI TMS320C6x, E-8 to E-10TI TMS320C6x instruction
packet, E-10TI TMS320C55, E-6 to E-7,
E-7 to E-8TI TMS320C64x, E-9
EEMBC benchmark suite, E-12overview, E-2performance, E-13 to E-14real-time processing, E-3 to E-5RISC systems
addressing modes, K-6addressing modes and
instruction formats, K-5 to K-6
arithmetic/logical instructions, K-24
conditional branches, K-17constant extension, K-9control instructions, K-16conventions, K-16data transfer instructions, K-14,
K-23DSP extensions, K-19examples, K-3, K-4instruction formats, K-8multiply-accumulate, K-20
Sanyo digital camera SOC, E-20Sanyo VPC-SX500 digital camera
case study, E-19Sony PlayStation 2 block diagram,
E-16Sony PlayStation 2 Emotion
Engine case study, E-15 to E-18
Sony PlayStation 2 Emotion Engine organization, E-18
EMC, L-80Emotion Engine
organization modes, E-18Sony PlayStation 2 case study,
E-15 to E-18empowerTel Networks, MXP
processor, E-14Encoding
control flow instructions, A-18erasure encoding, 439instruction set, A-21 to A-24, A-22Intel 80x86 instructions, K-55, K-58
I-24 ■ Index
Encoding (continued )ISAs, 14, A-5 to A-6MIPS ISA, A-33MIPS pipeline, C-36opcode, A-13VAX instructions, K-68 to K-70,
K-69VLIW model, 195–196
Encore Multimax, L-59End-to-end flow control
congestion management, F-65vs. network-only features, F-94 to
F-95Energy efficiency, see also Power
consumptionClimate Savers Computing
Initiative, 462embedded benchmarks, E-13hardward fallacies, 56ILP exploitation, 201Intel Core i7, 401–405ISA, 241–243microprocessor, 23–26PMDs, 6processor performance equation, 52servers, 25and speculation, 211–212system trends, 21–23WSC, measurement, 450–452WSC goals/requirements, 433WSC infrastructure, 447–449WSC servers, 462–464
Energy proportionality, WSC servers, 462
Engineering Research Associates (ERA), L-4 to L-5
ENIAC (Electronic Numerical Integrator and Calculator), L-2 to L-3, L-5 to L-6, L-77
Enigma coding machine, L-4Entry time, transactions, D-16, D-17Environmental faults, storage systems,
D-11EPIC approach
historical background, L-32IA-64, H-33VLIW processors, 194, 196
Equal condition code, PowerPC, K-10 to K-11
ERA, see Engineering Research Associates (ERA)
Erasure encoding, WSCs, 439Error-Correcting Code (ECC)
disk storage, D-11fault detection pitfalls, 58Fermi GPU architecture, 307hardware dependability, D-15memory dependability, 104RAID 2, D-6and WSCs, 473–474
Error handling, interconnection networks, F-12
Errors, definition, D-10 to D-11Escape resource set, F-47ETA processor, vector processor
history, G-26 to G-27Ethernet
and bandwidth, F-78commercial interconnection
networks, F-63cross-company interoperability, F-64interconnection networks, F-89as LAN, F-77 to F-79LAN history, F-99LANs, F-4packet format, F-75shared-media networks, F-23shared- vs. switched-media
networks, F-22storage area network history,
F-102switch vs. NIC, F-86system area networks, F-100total time statistics, F-90WAN history, F-98
Ethernet switchesarchitecture considerations, 16Dell servers, 53Google WSC, 464–465, 469historical performance milestones,
20WSCs, 441–444
European Center for Particle Research (CERN), F-98
Even/odd arrayexample, J-52integer multiplication, J-52
EVEN-ODD scheme, development, D-10
EX, see Execution address cycle (EX)Example calculations
average memory access time, B-16 to B-17
barrier synchronization, I-15block size and average memory
access time, B-26 to B-28branch predictors, 164branch schemes, C-25 to C-26branch-target buffer branch
penalty, 205–206bundles, H-35 to H-36cache behavior impact, B-18, B-21cache hits, B-5cache misses, 83–84, 93–95cache organization impact, B-19 to
B-20carry-lookahead adder, J-39chime approximation, G-2compiler-based speculation, H-29
to H-31conditional instructions, H-23 to
H-24CPI and FP, 50–51credit-based control flow, F-10 to
F-11crossbar switch interconnections,
F-31 to F-32data dependences, H-3 to H-4DAXPY on VMIPS, G-18 to G-20dependence analysis, H-7 to H-8deterministic vs. adaptive routing,
F-52 to F-55dies, 29die yield, 31dimension-order routing, F-47 to
F-48disk subsystem failure rates, 48fault tolerance, F-68fetch-and-increment barrier, I-20
to I-21FFT, I-27 to I-29fixed-point arithmetic, E-5 to E-6floating-point addition, J-24 to J-25floating-point square root, 47–48GCD test, 319, H-7geometric means, 43–44hardware-based speculation,
200–201inclusion, 397information tables, 176–177integer multiplication, J-9interconnecting node costs, F-35interconnection network latency
and effective bandwidth, F-26 to F-28
Index ■ I-25
I/O system utilization, D-26L1 cache speed, 80large-scale multiprocessor locks,
I-20large-scale multiprocessor
synchronization, I-12 to I-13
loop-carried dependences, 316, H-4 to H-5
loop-level parallelism, 317loop-level parallelism
dependences, 320loop unrolling, 158–160MapReduce cost on EC2, 458–460memory banks, 276microprocessor dynamic energy/
power, 23MIPS/VMIPS for DAXPY loop,
267–268miss penalty, B-33 to B-34miss rates, B-6, B-31 to B-32miss rates and cache sizes, B-29 to
B-30miss support, 85M/M/1 model, D-33MTTF, 34–35multimedia instruction compiler
support, A-31 to A-32multiplication algorithm, J-19network effective bandwidth, F-18network topologies, F-41 to F-43Ocean application, I-11 to I-12packet latency, F-14 to F-15parallel processing, 349–350, I-33
to I-34pipeline execution rate, C-10 to
C-11pipeline structural hazards, C-14 to
C-15power-performance benchmarks,
439–440predicated instructions, H-25processor performance
comparison, 218–219queue I/O requests, D-29queue waiting time, D-28 to D-29queuing, D-31radix-4 SRT division, J-56redundant power supply reliability,
35ROB commit, 187ROB instructions, 189
scoreboarding, C-77sequential consistency, 393server costs, 454–455server power, 463signed-digit numbers, J-53signed numbers, J-7SIMD multimedia instructions,
284–285single-precision numbers, J-15,
J-17software pipelining, H-13 to H-14speedup, 47status tables, 178strides, 279TB-80 cluster MTTF, D-41TB-80 IOPS, D-39 to D-40torus topology interconnections,
F-36 to F-38true sharing misses and false
sharing, 366–367VAX instructions, K-67vector memory systems, G-9vector performance, G-8vector vs. scalar operation, G-19vector sequence chimes, 270VLIW processors, 195VMIPS vector operation, G-6 to
G-7way selection, 82write buffer and read misses, B-35
to B-36write vs. no-write allocate, B-12WSC memory latency, 445WSC running service availability,
434–435WSC server data transfer, 446
ExceptionsALU instructions, C-4architecture-specific examples,
C-44categories, C-46control dependence, 154–155floating-point arithmetic, J-34 to
J-35hardware-based speculation, 190imprecise, 169–170, 188long latency pipelines, C-55MIPS, C-48, C-48 to C-49out-of-order completion, 169–170precise, C-47, C-58 to C-60preservation via hardward support,
H-28 to H-32
return address buffer, 207ROB instructions, 190speculative execution, 222stopping/restarting, C-46 to C-47types and requirements, C-43 to
C-46Execute step
instruction steps, 174Itanium 2, H-42ROB instruction, 186TI 320C55 DSP, E-7
Execution address cycle (EX)basic MIPS pipeline, C-36data hazards requiring stalls, C-21data hazard stall minimization,
C-17exception stopping/restarting, C-46
to C-47hazards and forwarding, C-56 to
C-57MIPS FP operations, basic
considerations, C-51 to C-53
MIPS pipeline, C-52MIPS pipeline control, C-36 to
C-39MIPS R4000, C-63 to C-64, C-64MIPS scoreboarding, C-72, C-74,
C-77out-of-order execution, C-71pipeline branch issues, C-40, C-42RISC classic pipeline, C-10simple MIPS implementation,
C-31 to C-32simple RISC implementation, C-6
Execution timeAmdahl’s law, 46–47, 406application/OS misses, B-59cache performance, B-3 to B-4,
B-16calculation, 36commercial workloads, 369–370,
370energy efficiency, 211integrated circuits, 22loop unrolling, 160multilevel caches, B-32 to B-34multiprocessor performance,
405–406multiprogrammed parallel “make”
workload, 375multithreading, 232
I-26 ■ Index
Execution time (continued )performance equations, B-22pipelining performance, C-3, C-10
to C-11PMDs, 6principle of locality, 45processor comparisons, 243processor performance equation,
49, 51reduction, B-19second-level cache size, B-34SPEC benchmarks, 42–44, 43, 56and stall time, B-21vector length, G-7vector mask registers, 276vector operations, 268–271
Expand-down field, B-53Explicit operands, ISA classifications,
A-3 to A-4Explicit parallelism, IA-64, H-34 to
H-35Explicit unit stride, GPUs vs. vector
architectures, 310Exponential back-off
large-scale multiprocessor synchronization, I-17
spin lock, I-17Exponential distribution, definition,
D-27Extended accumulator
flawed architectures, A-44ISA classification, A-3
FFacebook, 460Failures, see also Mean time between
failures (MTBF); Mean time to failure (MTTF)
Amdahl’s law, 56Berkeley’s Tertiary Disk project,
D-12cloud computing, 455definition, D-10dependability, 33–35dirty bits, D-61 to D-64DRAM, 473example calculation, 48Google WSC networking, 469–470power failure, C-43 to C-44, C-46power utilities, 435RAID reconstruction, D-55 to
D-57
RAID row-diagonal parity, D-9rate calculations, 48servers, 7, 434SLA states, 34storage system components, D-43storage systems, D-6 to D-10TDP, 22Tertiary Disk, D-13WSC running service, 434–435WSCs, 8, 438–439WSC storage, 442–443
False sharingdefinition, 366–367shared-memory workload, 373
FarmVille, 460Fast Fourier transformation (FFT)
characteristics, I-7distributed-memory
multiprocessor, I-32example calculations, I-27 to I-29symmetric shared-memory
multiprocessors, I-22, I-23, I-25
Fast traps, SPARC instructions, K-30Fat trees
definition, F-34NEWS communication, F-43routing algorithms, F-48SAN characteristics, F-76topology, F-38 to F-39torus topology interconnections,
F-36 to F-38Fault detection, pitfalls, 57–58Fault-induced deadlock, routing, F-44Faulting prefetches, cache
optimization, 92Faults, see also Exceptions; Page
faultsaddress fault, B-42definition, D-10and dependability, 33dependability benchmarks, D-21programming mistakes, D-11storage systems, D-6 to D-10Tandem Computers, D-12 to D-13VAX systems, C-44
Fault toleranceand adaptive routing, F-94commercial interconnection
networks, F-66 to F-69DECstation 5000 reboots, F-69dependability benchmarks, D-21
RAID, D-7SAN example, F-74WSC memory, 473–474WSC network, 461
Fault-tolerant routing, commercial interconnection networks, F-66 to F-67
FC, see Fibre Channel (FC)FC-AL, see Fibre Channel Arbitrated
Loop (FC-AL)FC-SW, see Fibre Channel Switched
(FC-SW)Feature size
dependability, 33integrated circuits, 19–21
FEC, see Forward error correction (FEC)
Federal Communications Commission (FCC), telephone company outages, D-15
Fermi GPUarchitectural innovations, 305–308future features, 333Grid mapping, 293multithreaded SIMD Processor,
307NVIDIA, 291, 305SIMD, 296–297SIMD Thread Scheduler, 306
Fermi Tesla, GPU computing history, L-52
Fermi Tesla GTX 280GPU comparison, 324–325, 325memory bandwidth, 328raw/relative GPU performance,
328synchronization, 329weaknesses, 330
Fermi Tesla GTX 480floorplan, 295GPU comparisons, 323–330, 325
Fetch-and-incrementlarge-scale multiprocessor
synchronization, I-20 to I-21
sense-reversing barrier, I-21synchronization, 388
Fetching, see Data fetchingFetch stage, TI 320C55 DSP, E-7FFT, see Fast Fourier transformation
(FFT)Fibre Channel (FC), F-64, F-67, F-102
Index ■ I-27
file system benchmarking, D-20NetApp FAS6000 filer, D-42
Fibre Channel Arbitrated Loop (FC-AL), F-102
block servers vs. filers, D-35SCSI history, L-81
Fibre Channel Switched (FC-SW), F-102
Field-programmable gate arrays (FPGAs), WSC array switch, 443
FIFO, see First-in first-out (FIFO)Filers
vs. block servers, D-34 to D-35NetApp FAS6000 filer, D-41 to
D-42Filer servers, SPEC benchmarking,
D-20 to D-21Filters, radio receiver, E-23Fine-grained multithreading
definition, 224–226Sun T1 effectiveness, 226–229
Fingerprint, storage system, D-49Finite-state machine, routing
implementation, F-57Firmware, network interfaces, F-7First-in first-out (FIFO)
block replacement, B-9cache misses, B-10definition, D-26Tomasulo’s algorithm, 173
First-level caches, see also L1 cachesARM Cortex-A8, 114cache optimization, B-30 to B-32hit time/power reduction, 79–80inclusion, B-35interconnection network, F-87Itanium 2, H-41memory hierarchy, B-48 to B-49miss rate calculations, B-31 to
B-35parameter ranges, B-42technology trends, 18virtual memory, B-42
First-reference misses, definition, B-23
FIT rates, WSC memory, 473–474Fixed-field decoding, simple RISC
implementation, C-6Fixed-length encoding
general-purpose registers, A-6instruction sets, A-22
ISAs, 14Fixed-length vector
SIMD, 284vector registers, 264
Fixed-point arithmetic, DSP, E-5 to E-6
Flagsperformance benchmarks, 37performance reporting, 41scoreboarding, C-75
Flash memorycharacteristics, 102–104dependability, 104disk storage, D-3 to D-4embedded benchmarks, E-13memory hierarchy design, 72technology trends, 18WSC cost-performance, 474–475
FLASH multiprocessor, L-61Flexible chaining
vector execution time, 269vector processor, G-11
Floating-point (FP) operationsaddition
denormals, J-26 to J-27overview, J-21 to J-25rules, J-24speedup, J-25 to J-26
arithmetic intensity, 285–288, 286branch condition evaluation, A-19branches, A-20cache misses, 83–84chip comparison, J-58control flow instructions, A-21CPI calculations, 50–51data access benchmarks, A-15data dependences, 151data hazards, 169denormal multiplication, J-20 to
J-21denormals, J-14 to J-15desktop RISCs, K-13, K-17, K-23DSP media extensions, E-10 to E-11dynamic scheduling with
Tomasulo’s algorithm, 171–172, 173
early computer arithmetic, J-64 to J-65
exceptions, J-34 to J-35exception stopping/restarting, C-47fused multiply-add, J-32 to J-33IBM 360, K-85
IEEE 754 FP standard, J-16ILP exploitation, 197–199ILP exposure, 157–158ILP in perfect processor, 215ILP for realizable processors,
216–218independent, C-54instruction operator categories,
A-15integer conversions, J-62Intel Core i7, 240, 241Intel 80x86, K-52 to K-55, K-54,
K-61Intel 80x86 registers, K-48ISA performance and efficiency
prediction, 241Itanium 2, H-41iterative division, J-27 to J-31latencies, 157and memory bandwidth, J-62MIPS, A-38 to A-39
Tomasulo’s algorithm, 173MIPS exceptions, C-49MIPS operations, A-35MIPS pipeline, C-52
basic considerations, C-51 to C-54
execution, C-71performance, C-60 to C-61,
C-61scoreboarding, C-72stalls, C-62
MIPS precise exceptions, C-58 to C-60
MIPS R4000, C-65 to C-67, C-66 to C-67
MIPS scoreboarding, C-77MIPS with scoreboard, C-73misspeculation instructions, 212Multimedia SIMD Extensions, 285multimedia support, K-19multiple lane vector unit, 273multiple outstanding, C-54multiplication
examples, J-19overview, J-17 to J-20
multiplication precision, J-21number representation, J-15 to J-16operand sizes/types, 12overflow, J-11overview, J-13 to J-14parallelism vs. window size, 217
I-28 ■ Index
Floating-point operations (continued)pipeline hazards and forwarding,
C-55 to C-57pipeline structural hazards, C-16precisions, J-33 to J-34remainder, J-31 to J-32ROB commit, 187SMT, 398–400SPARC, K-31SPEC benchmarks, 39special values, J-14 to J-15stalls from RAW hazards, C-55static branch prediction, C-26 to
C-27Tomasulo’s algorithm, 185underflow, J-36 to J-37, J-62VAX, B-73vector chaining, G-11vector sequence chimes, 270VLIW processors, 195VMIPS, 264
Floating-point registers (FPRs)IA-64, H-34IBM Blue Gene/L, I-42MIPS data transfers, A-34MIPS operations, A-36MIPS64 architecture, A-34write-back, C-56
Floating-point square root (FPSQR)calculation, 47–48CPI calculations, 50–51
Floating Point Systems AP-120B, L-28
Floppy disks, L-78Flow-balanced state, D-23Flow control
and arbitration, F-21congestion management, F-65direct networks, F-38 to F-39format, F-58interconnection networks, F-10 to
F-11system area network history, F-100
to F-101Fluent, F-76, F-77Flush, branch penalty reduction, C-22FM, see Frequency modulation (FM)Form factor, interconnection
networks, F-9 to F-12FORTRAN
compiler types and classes, A-28compiler vectorization, G-14, G-15
dependence analysis, H-6integer division/remainder, J-12loop-level parallelism
dependences, 320–321MIPS scoreboarding, C-77performance measurement history,
L-6return address predictors, 206
Forward error correction (FEC), DSP, E-5 to E-7
Forwarding, see also BypassingALUs, C-40 to C-41data hazard stall minimization,
C-16 to C-19, C-18dynamically scheduled pipelines,
C-70 to C-71load instruction, C-20longer latency pipelines, C-54 to
C-58operand, C-19
Forwarding tablerouting implementation, F-57switch microarchitecture
pipelining, F-60Forward path, cell phones, E-24Fourier-Motzkin algorithm, L-31Fourier transform, DSP, E-5Four-way conflict misses, definition,
B-23FP, see Floating-point (FP) operationsFPGAs, see Field-programmable gate
arrays (FPGAs)FPRs, see Floating-point registers
(FPRs)FPSQR, see Floating-point square root
(FPSQR)Frame pointer, VAX, K-71Freeze, branch penalty reduction,
C-22Frequency modulation (FM), wireless
neworks, E-21Front-end stage, Itanium 2, H-42FU, see Functional unit (FU)Fujitsu Primergy BX3000 blade
server, F-85Fujitsu VP100, L-45, L-47Fujitsu VP200, L-45, L-47Full access
dimension-order routing, F-47 to F-48
interconnection network topology, F-29
Full adders, J-2, J-3Fully associative cache
block placement, B-7conflict misses, B-23direct-mapped cache, B-9memory hierarchy basics, 74
Fully connected topologydistributed switched networks,
F-34NEWS communication, F-43
Functional hazardsARM Cortex-A8, 233microarchitectural techniques case
study, 247–254Functional unit (FU)
FP operations, C-66instruction execution example,
C-80Intel Core i7, 237Itanium 2, H-41 to H-43latencies, C-53MIPS pipeline, C-52MIPS scoreboarding, C-75 to C-80OCNs, F-3vector add instruction, 272,
272–273VMIPS, 264
Function callsGPU programming, 289NVIDIA GPU Memory structures,
304–305PTX assembler, 301
Function pointers, control flow instruction addressing modes, A-18
Fused multiply-add, floating point, J-32 to J-33
Future file, precise exceptions, C-59
GGateways, Ethernet, F-79Gather-Scatter
definition, 309GPU comparisons, 329multimedia instruction compiler
support, A-31sparse matrices, G-13 to G-14vector architectures, 279–280
GCD, see Greatest common divisor (GCD) test
GDDR, see Graphics double data rate (GDDR)
Index ■ I-29
GDRAM, see Graphics dynamic random-access memory (GDRAM)
GE 645, L-9General-Purpose Computing on GPUs
(GPGPU), L-51 to L-52General-purpose electronic computers,
historical background, L-2 to L-4
General-purpose registers (GPRs)advantages/disadvantages, A-6IA-64, H-38Intel 80x86, K-48ISA classification, A-3 to A-5MIPS data transfers, A-34MIPS operations, A-36MIPS64, A-34VMIPS, 265
GENI, see Global Environment for Network Innovation (GENI)
Geometric means, example calculations, 43–44
GFS, see Google File System (GFS)Gibson mix, L-6Giga Thread Engine, definition, 292,
314Global address space, segmented
virtual memory, B-52Global code scheduling
example, H-16parallelism, H-15 to H-23superblock scheduling, H-21 to
H-23, H-22trace scheduling, H-19 to H-21,
H-20Global common subexpression
elimination, compiler structure, A-26
Global data area, and compiler technology, A-27
Global Environment for Network Innovation (GENI), F-98
Global load/store, definition, 309Global Memory
definition, 292, 314GPU programming, 290locks via coherence, 390
Global miss ratedefinition, B-31multilevel caches, B-33
Global optimizationscompilers, A-26, A-29optimization types, A-28
Global Positioning System, CDMA, E-25Global predictors
Intel Core i7, 166tournament predictors, 164–166
Global scheduling, ILP, VLIW processor, 194
Global system for mobile communication (GSM), cell phones, E-25
Goldschmidt’s division algorithm, J-29, J-61
Goldstine, Herman, L-2 to L-3Google
Bigtable, 438, 441cloud computing, 455cluster history, L-62containers, L-74MapReduce, 437, 458–459, 459server CPUs, 440server power-performance
benchmarks, 439–441WSCs, 432, 449
containers, 464–465, 465cooling and power, 465–468monitoring and repairing,
469–470PUE, 468servers, 467, 468–469
Google App Engine, L-74Google Clusters
memory dependability, 104power consumption, F-85
Google File System (GFS)MapReduce, 438WSC storage, 442–443
Google GogglesPMDs, 6user experience, 4
Google searchshared-memory workloads, 369workload demands, 439
Gordon Bell Prize, L-57GPGPU (General-Purpose Computing
on GPUs), L-51 to L-52GPRs, see General-purpose registers
(GPRs)GPU (Graphics Processing Unit)
banked and graphics memory, 322–323
computing history, L-52definition, 9DLP
basic considerations, 288basic PTX thread instructions,
299conditional branching, 300–303coprocessor relationship,
330–331definitions, 309Fermi GPU architecture
innovations, 305–308Fermi GTX 480 floorplan, 295GPUs vs. vector architectures,
308–312, 310mapping examples, 293Multimedia SIMD comparison,
312multithreaded SIMD Processor
block diagram, 294NVIDIA computational
structures, 291–297NVIDIA/CUDA and AMD
terminology, 313–315NVIDIA GPU ISA, 298–300NVIDIA GPU Memory
structures, 304, 304–305programming, 288–291SIMD thread scheduling, 297terminology, 292
fine-grained multithreading, 224future features, 332gather/scatter operations, 280historical background, L-50loop-level parallelism, 150vs. MIMD with Multimedia SIMD,
324–330mobile client/server features, 324,
324power/DLP issues, 322raw/relative performance, 328Roofline model, 326scalable, L-50 to L-51strided access-TLB interactions,
323thread count and memory
performance, 332TLP, 346vector kernel implementation,
334–336vs. vector processor operation,
276
I-30 ■ Index
GPU Memorycaches, 306CUDA program, 289definition, 292, 309, 314future architectures, 333GPU programming, 288NVIDIA, 304, 304–305splitting from main memory, 330
Gradual underflow, J-15, J-36Grain size
MIMD, 10TLP, 346
Grant phase, arbitration, F-49Graph coloring, register allocation,
A-26 to A-27Graphics double data rate (GDDR)
characteristics, 102Fermi GTX 480 GPU, 295, 324
Graphics dynamic random-access memory (GDRAM)
bandwidth issues, 322–323characteristics, 102
Graphics-intensive benchmarks, desktop performance, 38
Graphics pipelines, historical background, L-51
Graphics Processing Unit, see GPU (Graphics Processing Unit)
Graphics synchronous dynamic random-access memory (GSDRAM), characteristics, 102
Graphics Synthesizer, Sony PlayStation 2, E-16, E-16 to E-17
Greater than condition code, PowerPC, K-10 to K-11
Greatest common divisor (GCD) test, loop-level parallelism dependences, 319, H-7
Gridarithmetic intensity, 286CUDA parallelism, 290definition, 292, 309, 313and GPU, 291GPU Memory structures, 304GPU terms, 308mapping example, 293NVIDIA GPU computational
structures, 291
SIMD Processors, 295Thread Blocks, 295
Grid computing, L-73 to L-74Grid topology
characteristics, F-36direct networks, F-37
GSDRAM, see Graphics synchronous dynamic random-access memory (GSDRAM)
GSM, see Global system for mobile communication (GSM)
Guest definition, 108Guest domains, Xen VM, 111
HHadoop, WSC batch processing, 437Half adders, J-2Half words
aligned/misaligned addresses, A-8memory address interpretation,
A-7 to A-8MIPS data types, A-34operand sizes/types, 12as operand type, A-13 to A-14
Handshaking, interconnection networks, F-10
Hard drive, power consumption, 63Hard real-time systems, definition, E-3
to E-4Hardware
as architecture component, 15cache optimization, 96compiler scheduling support, L-30
to L-31compiler speculation support
memory references, H-32overview, H-27preserving exception behavior,
H-28 to H-32description notation, K-25energy/performance fallacies, 56for exposing parallelism, H-23 to
H-27ILP approaches, 148, 214–215interconnection networks, F-9pipeline hazard detection, C-38Virtual Machines protection, 108WSC cost-performance, 474WSC running service, 434–435
Hardware-based speculationbasic algorithm, 191
data flow execution, 184FP unit using Tomasulo’s
algorithm, 185ILP
data flow execution, 184with dynamic scheduling and
multiple issue, 197–202FP unit using Tomasulo’s
algorithm, 185key ideas, 183–184multiple-issue processors, 198reorder buffer, 184–192vs. software speculation,
221–222key ideas, 183–184
Hardware faults, storage systems, D-11
Hardware prefetchingcache optimization, 131–133miss penalty/rate reduction, 91–92NVIDIA GPU Memory structures,
305SPEC benchmarks, 92
Hardware primitiviesbasic types, 387–389large-scale multiprocessor
synchronization, I-18 to I-21
synchronization mechanisms, 387–389
Harvard architecture, L-4Hazards, see also Data hazards
branch hazards, C-21 to C-26, C-39 to C-42, C-42
control hazards, 235, C-11detection, hardware, C-38dynamically scheduled pipelines,
C-70 to C-71execution sequences, C-80functional hazards, 233, 247–254instruction set complications, C-50longer latency pipelines, C-54 to
C-58structural hazards, 268–269, C-11,
C-13 to C-16, C-71, C-78 to C-79
HCAs, see Host channel adapters (HCAs)
Headermessages, F-6packet format, F-7
Index ■ I-31
switch microarchitecture pipelining, F-60
TCP/IP, F-84Head-of-line (HOL) blocking
congestion management, F-64switch microarchitecture, F-58 to
F-59, F-59, F-60, F-62system area network history, F-101virtual channels and throughput,
F-93Heap, and compiler technology, A-27
to A-28HEP processor, L-34Heterogeneous architecture,
definition, 262Hewlett-Packard AlphaServer,
F-100Hewlett-Packard PA-RISC
addressing modes, K-5arithmetic/logical instructions,
K-11characteristics, K-4conditional branches, K-12, K-17,
K-34constant extension, K-9conventions, K-13data transfer instructions, K-10EPIC, L-32features, K-44floating-point precisions, J-33FP instructions, K-23MIPS core extensions, K-23multimedia support, K-18, K-18,
K-19unique instructions, K-33 to K-36
Hewlett-Packard PA-RISC MAX2, multimedia support, E-11
Hewlett-Packard Precision Architecture, integer arithmetic, J-12
Hewlett-Packard ProLiant BL10e G2 Blade server, F-85
Hewlett-Packard ProLiant SL2x170z G6, SPECPower benchmarks, 463
Hewlett-Packard RISC microprocessors, vector processor history, G-26
Higher-radix division, J-54 to J-55Higher-radix multiplication, integer,
J-48
High-level language computer architecture (HLLCA), L-18 to L-19
High-level optimizations, compilers, A-26
Highly parallel memory systems, case studies, 133–136
High-order functions, control flow instruction addressing modes, A-18
High-performance computing (HPC)InfiniBand, F-74interconnection network
characteristics, F-20interconnection network topology,
F-44storage area network history, F-102switch microarchitecture, F-56vector processor history, G-27write strategy, B-10vs. WSCs, 432, 435–436
Hillis, Danny, L-58, L-74Histogram, D-26 to D-27History file, precise exceptions, C-59Hitachi S810, L-45, L-47Hitachi SuperH
addressing modes, K-5, K-6arithmetic/logical instructions,
K-24branches, K-21characteristics, K-4condition codes, K-14data transfer instructions, K-23embedded instruction format, K-8multiply-accumulate, K-20unique instructions, K-38 to K-39
Hit timeaverage memory access time, B-16
to B-17first-level caches, 79–80memory hierarchy basics, 77–78reduction, 78, B-36 to B-40way prediction, 81–82
HLLCA, see High-level language computer architecture (HLLCA)
HOL, see Head-of-line blocking (HOL)
Home node, directory-based cache coherence protocol basics, 382
Hop count, definition, F-30
Hopsdirect network topologies, F-38routing, F-44switched network topologies, F-40switching, F-50
Host channel adapters (HCAs)historical background, L-81switch vs. NIC, F-86
Host definition, 108, 305Hot swapping, fault tolerance, F-67HPC, see High-performance
computing (HPC)HPC Challenge, vector processor
history, G-28HP-Compaq servers
price-performance differences, 441SMT, 230
HPSm, L-29Hypercube networks
characteristics, F-36deadlock, F-47direct networks, F-37vs. direct networks, F-92NEWS communication, F-43
HyperTransport, F-63NetApp FAS6000 filer, D-42
Hypertransport, AMD Opteron cache coherence, 361
Hypervisor, characteristics, 108
IIAS machine, L-3, L-5 to L-6IBM
Chipkill, 104cluster history, L-62, L-72computer history, L-5 to L-6early VM work, L-10magnetic storage, L-77 to L-78multiple-issue processor
development, L-28RAID history, L-79 to L-80
IBM 360address space, B-58architecture, K-83 to K-84architecture flaws and success,
K-81branch instructions, K-86characteristics, K-42computer architecture definition,
L-17 to L-18instruction execution frequencies,
K-89
I-32 ■ Index
IBM 360 (continued )instruction operator categories,
A-15instruction set, K-85 to K-88instruction set complications, C-49
to C-50integer/FP R-R operations, K-85I/O bus history, L-81memory hierarchy development,
L-9 to L-10parallel processing debates, L-57protection and ISA, 112R-R instructions, K-86RS and SI format instructions,
K-87RX format instructions, K-86 to
K-87SS format instructions, K-85 to
K-88IBM 360/85, L-10 to L-11, L-27IBM 360/91
dynamic scheduling with Tomasulo’s algorithm, 170–171
early computer arithmetic, J-63history, L-27speculation concept origins, L-29
IBM 370architecture, K-83 to K-84characteristics, K-42early computer arithmetic, J-63integer overflow, J-11protection and ISA, 112vector processor history, G-27Virtual Machines, 110
IBM 370/158, L-7IBM 650, L-6IBM 701, L-5 to L-6IBM 702, L-5 to L-6IBM 704, L-6, L-26IBM 705, L-6IBM 801, L-19IBM 3081, L-61IBM 3090 Vector Facility, vector
processor history, G-27IBM 3840 cartridge, L-77IBM 7030, L-26IBM 9840 cartridge, L-77IBM AS/400, L-79IBM Blue Gene/L, F-4
adaptive routing, F-93cluster history, L-63
commercial interconnection networks, F-63
computing node, I-42 to I-44, I-43as custom cluster, I-41 to I-42deterministic vs. adaptive routing,
F-52 to F-55fault tolerance, F-66 to F-67link bandwidth, F-89low-dimensional topologies, F-100parallel processing debates, L-58software overhead, F-91switch microarchitecture, F-62system, I-44system area network history, F-101
to F-1023D torus network, F-72 to F-74topology, F-30, F-39
IBM CodePack, RISC code size, A-23IBM CoreConnect
cross-company interoperability, F-64
OCNs, F-3IBM eServer p5 processor
performance/cost benchmarks, 409SMT and ST performance, 399speedup benchmarks, 408,
408–409IBM Federation network interfaces,
F-17 to F-18IBM J9 JVM
real-world server considerations, 52–55
WSC performance, 463IBM PCs, architecture flaws vs.
success, A-45IBM Power processors
branch-prediction buffers, C-29characteristics, 247exception stopping/restarting, C-47MIPS precise exceptions, C-59shared-memory multiprogramming
workload, 378IBM Power 1, L-29IBM Power 2, L-29IBM Power 4
multithreading history, L-35peak performance, 58recent advances, L-33 to L-34
IBM Power 5characteristics, F-73Itanium 2 comparison, H-43manufacturing cost, 62
multiprocessing/multithreading-based performance, 398–400
multithreading history, L-35IBM Power 7
vs. Google WSC, 436ideal processors, 214–215multicore processor performance,
400–401multithreading, 225
IBM Pulsar processor, L-34IBM RP3, L-60IBM RS/6000, L-57IBM RT-PC, L-20IBM SAGE, L-81IBM servers, economies of scale, 456IBM Stretch, L-6IBM zSeries, vector processor history,
G-27IC, see Instruction count (IC)I-caches
case study examples, B-63way prediction, 81–82
ICR, see Idle Control Register (ICR)ID, see Instruction decode (ID)Ideal pipeline cycles per instruction,
ILP concepts, 149Ideal processors, ILP hardware model,
214–215, 219–220IDE disks, Berkeley’s Tertiary Disk
project, D-12Idle Control Register (ICR), TI
TMS320C55 DSP, E-8Idle domains, TI TMS320C55 DSP,
E-8IEEE 754 floating-point standard, J-16IEEE 1394, Sony PlayStation 2
Emotion Engine case study, E-15
IEEE arithmeticfloating point, J-13 to J-14
addition, J-21 to J-25exceptions, J-34 to J-35remainder, J-31 to J-32underflow, J-36
historical background, J-63 to J-64iterative division, J-30–x vs. 0 –x, J-62NaN, J-14rounding modes, J-20single-precision numbers, J-15 to
J-16
Index ■ I-33
IEEE standard 802.3 (Ethernet), F-77 to F-79
LAN history, F-99IF, see Instruction fetch (IF) cycleIF statement handling
control dependences, 154GPU conditional branching, 300,
302–303memory consistency, 392vectorization in code, 271vector-mask registers, 267, 275–276
Illiac IV, F-100, L-43, L-55ILP, see Instruction-level parallelism
(ILP)Immediate addressing mode
ALU operations, A-12basic considerations, A-10 to A-11MIPS, 12MIPS instruction format, A-35MIPS operations, A-37value distribution, A-13
IMPACT, L-31Implicit operands, ISA classifications,
A-3Implicit unit stride, GPUs vs. vector
architectures, 310Imprecise exceptions
data hazards, 169–170floating-point, 188
IMT-2000, see International Mobile Telephony 2000 (IMT-2000)
Inactive power modes, WSCs, 472Inclusion
cache hierarchy, 397–398implementation, 397–398invalidate protocols, 357memory hierarchy history, L-11
Indexed addressingIntel 80x86, K-49, K-58VAX, K-67
Indexesaddress translation during, B-36 to
B-40AMD Opteron data cache, B-13 to
B-14ARM Cortex-A8, 115recurrences, H-12size equations, B-22
Index field, block identification, B-8Index vector, gather/scatter operations,
279–280
Indirect addressing, VAX, K-67Indirect networks, definition, F-31Inexact exception
floating-point arithmetic, J-35floating-point underflow, J-36
InfiniBand, F-64, F-67, F-74 to F-77cluster history, L-63packet format, F-75storage area network history,
F-102switch vs. NIC, F-86system area network history, F-101
Infinite population model, queuing model, D-30
In flight instructions, ILP hardware model, 214
Information tables, examples, 176–177
Infrastructure costsWSC, 446–450, 452–455, 453WSC efficiency, 450–452
Initiation interval, MIPS pipeline FP operations, C-52 to C-53
Initiation ratefloating-point pipeline, C-65 to
C-66memory banks, 276–277vector execution time, 269
Inktomi, L-62, L-73In-order commit
hardware-based speculation, 188–189
speculation concept origins, L-29In-order execution
average memory access time, B-17 to B-18
cache behavior calculations, B-18cache miss, B-2 to B-3dynamic scheduling, 168–169IBM Power processors, 247ILP exploitation, 193–194multiple-issue processors, 194superscalar processors, 193
In-order floating-point pipeline, dynamic scheduling, 169
In-order issueARM Cortex-A8, 233dynamic scheduling, 168–170,
C-71ISA, 241
In-order scalar processors, VMIPS, 267
Input buffered switchHOL blocking, F-59, F-60microarchitecture, F-57, F-57pipelined version, F-61
Input-output buffered switch, microarchitecture, F-57
Instruction cacheAMD Opteron example, B-15antialiasing, B-38application/OS misses, B-59branch prediction, C-28commercial workload, 373GPU Memory, 306instruction fetch, 202–203, 237ISA, 241MIPS R4000 pipeline, C-63miss rates, 161multiprogramming workload,
374–375prefetch, 236RISCs, A-23TI TMS320C55 DSP, E-8
Instruction commithardware-based speculation,
184–185, 187–188, 188, 190
instruction set complications, C-49Intel Core i7, 237speculation support, 208–209
Instruction count (IC)addressing modes, A-10cache performance, B-4, B-16compiler optimization, A-29, A-29
to A-30processor performance time, 49–51RISC history, L-22
Instruction decode (ID)basic MIPS pipeline, C-36branch hazards, C-21data hazards, 169hazards and forwarding, C-55 to
C-57MIPS pipeline, C-71MIPS pipeline control, C-36 to
C-39MIPS pipeline FP operations, C-53MIPS scoreboarding, C-72 to C-74out-of-order execution, 170pipeline branch issues, C-39 to
C-41, C-42RISC classic pipeline, C-7 to C-8,
C-10
I-34 ■ Index
Instruction decode (continued )simple MIPS implementation, C-31simple RISC implementation, C-5
to C-6Instruction delivery stage, Itanium 2,
H-42Instruction fetch (IF) cycle
basic MIPS pipeline, C-35 to C-36branch hazards, C-21branch-prediction buffers, C-28exception stopping/restarting, C-46
to C-47MIPS exceptions, C-48MIPS R4000, C-63pipeline branch issues, C-42RISC classic pipeline, C-7, C-10simple MIPS implementation,
C-31simple RISC implementation, C-5
Instruction fetch unitsintegrated, 207–208Intel Core i7, 237
Instruction formatsARM-unique, K-36 to K-37high-level language computer
architecture, L-18IA-64 ISA, H-34 to H-35, H-38,
H-39IBM 360, K-85 to K-88Intel 80x86, K-49, K-52, K-56 to
K-57M32R-unique, K-39 to K-40MIPS16-unique, K-40 to K-42PA-RISC unique, K-33 to K-36PowerPC-unique, K-32 to K-33RISCs, K-43
Alpha-unique, K-27 to K-29arithmetic/logical, K-11, K-15branches, K-25control instructions, K-12,
K-16data transfers, K-10, K-14,
K-21desktop/server, K-7desktop/server systems, K-7embedded DSP extensions,
K-19embedded systems, K-8FP instructions, K-13hardware description notation,
K-25MIPS64-unique, K-24 to K-27
MIPS core, K-6 to K-9MIPS core extensions, K-19 to
K-24MIPS unaligned word reads,
K-26multimedia extensions, K-16 to
K-19overview, K-5 to K-6
SPARC-unique, K-29 to K-32SuperH-unique, K-38 to K-39Thumb-unique, K-37 to K-38
Instruction groups, IA-64, H-34Instruction issue
definition, C-36DLP, 322dynamic scheduling, 168–169,
C-71 to C-72ILP, 197, 216–217instruction-level parallelism, 2Intel Core i7, 238Itanium 2, H-41 to H-43MIPS pipeline, C-52multiple issue processor, 198multithreading, 223, 226parallelism measurement, 215precise exceptions, C-58, C-60processor comparison, 323ROB, 186speculation support, 208, 210Tomasulo’s scheme, 175, 182
Instruction-level parallelism (ILP)ARM Cortex-A8, 233–236,
235–236basic concepts/challenges,
148–149, 149“big and dumb” processors, 245branch-prediction buffers, C-29,
C-29 to C-30compiler scheduling, L-31compiler techniques for exposure,
156–162control dependence, 154–156data dependences, 150–152data flow limit, L-33definition, 9, 149–150dynamic scheduling
basic concept, 168–169definition, 168example and algorithms,
176–178multiple issue, speculation,
197–202
overcoming data hazards, 167–176
Tomasulo’s algorithm, 170–176, 178–179, 181–183
early studies, L-32 to L-33exploitation methods, H-22 to
H-23exploitation statically, H-2exposing with hardware support,
H-23GPU programming, 289hardware-based speculation,
183–192hardware vs. software speculation,
221–222IA-64, H-32instruction fetch bandwidth
basic considerations, 202–203branch-target buffers, 203–206,
204integrated units, 207–208return address predictors,
206–207Intel Core i7, 236–241limitation studies, 213–221microarchitectural techniques case
study, 247–254MIPS scoreboarding, C-77 to C-79multicore performance/energy
efficiency, 404multicore processor performance,
400multiple-issue processors, L-30multiple issue/static scheduling,
192–196multiprocessor importance, 344multithreading, basic
considerations, 223–226multithreading history, L-34 to L-35name dependences, 152–153perfect processor, 215pipeline scheduling/loop unrolling,
157–162processor clock rates, 244realizable processor limitations,
216–218RISC development, 2SMT on superscalar processors,
230–232speculation advantages/
disadvantages, 210–211
Index ■ I-35
speculation and energy efficiency, 211–212
speculation support, 208–210speculation through multiple
branches, 211speculative execution, 222–223Sun T1 fine-grained multithreading
effectiveness, 226–229switch to DLP/TLP/RLP, 4–5TI 320C6x DSP, E-8value prediction, 212–213
Instruction path length, processor performance time, 49
Instruction prefetchintegrated instruction fetch units,
208miss penalty/rate reduction, 91–92SPEC benchmarks, 92
Instruction register (IR)basic MIPS pipeline, C-35dynamic scheduling, 170MIPS implementation, C-31
Instruction set architecture (ISA), see also Intel 80x86 processors; Reduced Instruction Set Computer (RISC)
addressing modes, A-9 to A-10architect-compiler writer
relationship, A-29 to A-30
ARM Cortex-A8, 114case studies, A-47 to A-54class code sequence example, A-4classification, A-3 to A-7code size-compiler considerations,
A-43 to A-44compiler optimization and
performance, A-27compiler register allocation, A-26
to A-27compiler structure, A-24 to A-26compiler technology and
architecture decisions, A-27 to A-29
compiler types and classes, A-28complications, C-49 to C-51computer architecture definition,
L-17 to L-18control flow instructions
addressing modes, A-17 to A-18
basic considerations, A-16 to A-17, A-20 to A-21
conditional branch options, A-19
procedure invocation options, A-19 to A-20
Cray X1, G-21 to G-22data access distribution example,
A-15definition and types, 11–15displacement addressing mode,
A-10encoding considerations, A-21 to
A-24, A-22, A-24first vector computers, L-48flawless design, A-45flaws vs. success, A-44 to A-45GPR advantages/disadvantages,
A-6high-level considerations, A-39,
A-41 to A-43high-level language computer
architecture, L-18 to L-19
IA-64instruction formats, H-39instructions, H-35 to H-37instruction set basics, H-38overview, H-32 to H-33predication and speculation,
H-38 to H-40IBM 360, K-85 to K-88immediate addressing mode, A-10
to A-11literal addressing mode, A-10 to
A-11memory addressing, A-11 to A-13memory address interpretation,
A-7 to A-8MIPS
addressing modes for data transfer, A-34
basic considerations, A-32 to A-33
control flow instructions, A-37 to A-38
data types, A-34dynamic instruction mix, A-41
to A-42, A-42FP operations, A-38 to A-39instruction format, A-35MIPS operations, A-35 to A-37
registers, A-34usage, A-39
MIPS64, 14, A-40multimedia instruction compiler
support, A-31 to A-32NVIDIA GPU, 298–300operand locations, A-4operands per ALU instruction, A-6operand type and size, A-13 to
A-14operations, A-14 to A-16operator categories, A-15overview, K-2performance and efficiency
prediction, 241–243and protection, 112RISC code size, A-23 to A-24RISC history, L-19 to L-22, L-21stack architectures, L-16 to L-17top 80x86 instructions, A-16“typical” program fallacy, A-43Virtual Machines protection,
107–108Virtual Machines support,
109–110VMIPS, 264–265VMM implementation, 128–129
Instructions per clock (IPC)ARM Cortex-A8, 236flawless architecture design, A-45ILP for realizable processors,
216–218MIPS scoreboarding, C-72multiprocessing/
multithreading-based performance, 398–400
processor performance time, 49Sun T1 multithreading unicore
performance, 229Sun T1 processor, 399
Instruction statusdynamic scheduling, 177MIPS scoreboarding, C-75
Integer arithmeticaddition speedup
carry-lookahead, J-37 to J-41carry-lookahead circuit, J-38carry-lookahead tree, J-40carry-lookahead tree adder,
J-41carry-select adder, J-43, J-43 to
J-44, J-44
I-36 ■ Index
Integer arithmetic (continued )carry-skip adder, J-41 to J43,
J-42overview, J-37
divisionradix-2 division, J-55radix-4 division, J-56radix-4 SRT division, J-57with single adder, J-54 to J-58
FP conversions, J-62language comparison, J-12multiplication
array multiplier, J-50Booth recoding, J-49even/odd array, J-52with many adders, J-50 to J-54multipass array multiplier, J-51signed-digit addition table,
J-54with single adder, J-47 to J-49,
J-48Wallace tree, J-53
multiplication/division, shifting over zeros, J-45 to J-47
overflow, J-11Radix-2 multiplication/division,
J-4, J-4 to J-7restoring/nonrestoring division,
J-6ripply-carry addition, J-2 to J-3,
J-3signed numbers, J-7 to J-10SRT division, J-45 to J-47, J-46systems issues, J-10 to J-13
Integer operandflawed architecture, A-44GCD, 319graph coloring, A-27instruction set encoding, A-23MIPS data types, A-34as operand type, 12, A-13 to A-14
Integer operationsaddressing modes, A-11ALUs, A-12, C-54ARM Cortex-A8, 116, 232, 235,
236benchmarks, 167, C-69branches, A-18 to A-20, A-20cache misses, 83–84data access distribution, A-15data dependences, 151dependences, 322
desktop benchmarks, 38–39displacement values, A-12exceptions, C-43, C-45hardware ILP model, 215hardware vs. software speculation,
221hazards, C-57IBM 360, K-85ILP, 197–200instruction set operations, A-16Intel Core i7, 238, 240Intel 80x86, K-50 to K-51ISA, 242, A-2Itanium 2, H-41longer latency pipelines, C-55MIPS, C-31 to C-32, C-36, C-49,
C-51 to C-53MIPS64 ISA, 14MIPS FP pipeline, C-60MIPS R4000 pipeline, C-61, C-63,
C-70misspeculation, 212MVL, 274pipeline scheduling, 157precise exceptions, C-47, C-58,
C-60processor clock rate, 244R4000 pipeline, C-63realizable processor ILP, 216–218RISC, C-5, C-11scoreboarding, C-72 to C-73, C-76SIMD processor, 307SPARC, K-31SPEC benchmarks, 39speculation through multiple
branches, 211static branch prediction, C-26 to
C-27T1 multithreading unicore
performance, 227–229Tomasulo’s algorithm, 181tournament predictors, 164VMIPS, 265
Integer registershardware-based speculation, 192IA-64, H-33 to H-34MIPS dynamic instructions, A-41
to A-42MIPS floating-point operations,
A-39MIPS64 architecture, A-34VLIW, 194
Integrated circuit basicscell phones, E-24, E-24cost trends, 28–32dependability, 33–36logic technology, 17microprocessor developments, 2power and energy, 21–23scaling, 19–21
Intel 80286, L-9Intel Atom 230
processor comparison, 242single-threaded benchmarks, 243
Intel Atom processorsISA performance and efficiency
prediction, 241–243performance measurement,
405–406SMT, 231WSC memory, 474WSC processor cost-performance,
473Intel Core i7
vs. Alpha processors, 368architecture, 15basic function, 236–238“big and dumb” processors, 245branch predictor, 166–167clock rate, 244dynamic scheduling, 170GPU comparisons, 324–330, 325hardware prefetching, 91ISA performance and efficiency
prediction, 241–243L2/L3 miss rates, 125memory hierarchy basics, 78,
117–124, 119memory hierarchy design, 73memory performance, 122–124MESIF protocol, 362microprocessor die example, 29miss rate benchmarks, 123multibanked caches, 86multithreading, 225nonblocking cache, 83performance, 239, 239–241, 240performance/energy efficiency,
401–405pipelined cache access, 82pipeline structure, 237processor comparison, 242raw/relative GPU performance, 328Roofline model, 286–288, 287
Index ■ I-37
Intel Core i7 (continued )single-threaded benchmarks, 243SMP limitations, 363SMT, 230–231snooping cache coherence
implementation, 365three-level cache hierarchy, 118TLB structure, 118write invalid protocol, 356
Intel 80x86 processorsaddress encoding, K-58addressing modes, K-58address space, B-58architecture flaws and success, K-81architecture flaws vs. success,
A-44 to A-45Atom, 231cache performance, B-6characteristics, K-42common exceptions, C-44comparative operation
measurements, K-62 to K-64
floating-point operations, K-52 to K-55, K-54, K-61
instruction formats, K-56 to K-57instruction lengths, K-60instruction mix, K-61 to K-62instructions vs. DLX, K-63 to
K-64instruction set encoding, A-23,
K-55instruction set usage
measurements, K-56 to K-64
instructions and functions, K-52instruction types, K-49integer operations, K-50 to K-51integer overflow, J-11Intel Core i7, 117ISA, 11–12, 14–15, A-2memory accesses, B-6memory addressing, A-8memory hierarchy development,
L-9multimedia support, K-17operand addressing mode, K-59,
K-59 to K-60operand type distribution, K-59overview, K-45 to K-47process protection, B-50
vs. RISC, 2, A-3segmented scheme, K-50system evolution, K-48top instructions, A-16typical operations, K-53variable encoding, A-22 to A-23virtualization issues, 128Virtual Machines ISA support, 109Virtual Machines and virtual
memory and I/O, 110Intel 8087, floating point remainder,
J-31Intel i860, K-16 to K-17, L-49, L-60Intel IA-32 architecture
call gate, B-54descriptor table, B-52instruction set complications, C-49
to C-51OCNs, F-3, F-70segment descriptors, B-53segmented virtual memory, B-51
to B-54Intel IA-64 architecture
compiler scheduling history, L-31conditional instructions, H-27explicit parallelism, H-34 to H-35historical background, L-32ISA
instruction formats, H-39instructions, H-35 to H-37instruction set basics, H-38overview, H-32 to H-33predication and speculation,
H-38 to H-40Itanium 2 processor
instruction latency, H-41overview, H-40 to H-41performance, H-43, H-43
multiple issue processor approaches, 194
parallelism exploitation statically, H-2
register model, H-33 to H-34RISC history, L-22software pipelining, H-15synchronization history, L-64
Intel iPSC 860, L-60Intel Itanium, sparse matrices, G-13Intel Itanium 2
“big and dumb” processors, 245clock rate, 244
IA-64functional units and instruction
issue, H-41 to H-43instruction latency, H-41overview, H-40 to H-41performance, H-43
peak performance, 58SPEC benchmarks, 43
Intelligent devices, historical background, L-80
Intel MMX, multimedia instruction compiler support, A-31 to A-32
Intel Nehalemcharacteristics, 411floorplan, 30WSC processor cost-performance,
473Intel Paragon, F-100, L-60Intel Pentium 4
hardware prefetching, 92Itanium 2 comparison, H-43multithreading history, L-35
Intel Pentium 4 Extreme, L-33 to L-34Intel Pentium II, L-33Intel Pentium III
pipelined cache access, 82power consumption, F-85
Intel Pentium M, power consumption, F-85
Intel Pentium MMX, multimedia support, E-11
Intel Pentium Pro, 82, L-33Intel Pentium processors
“big and dumb” processors, 245clock rate, 244early computer arithmetic, J-64 to
J-65vs. Opteron memory protection, B-57pipelining performance, C-10segmented virtual memory
example, B-51 to B-54SMT, 230
Intel processorsearly RISC designs, 2power consumption, F-85
Intel Single-Chip Cloud Computing (SCCC)
as interconnection example, F-70 to F-72
OCNs, F-3
I-38 ■ Index
Intel Streaming SIMD Extension (SSE)
basic function, 283Multimedia SIMD Extensions,
A-31vs. vector architectures, 282
Intel Teraflops processors, OCNs, F-3Intel Thunder Tiger 4 QsNetII, F-63,
F-76Intel VT-x, 129Intel x86
Amazon Web Services, 456AVX instructions, 284clock rates, 244computer architecture, 15conditional instructions, H-27GPUs as coprocessors, 330–331Intel Core i7, 237–238Multimedia SIMD Extensions,
282–283NVIDIA GPU ISA, 298parallelism, 262–263performance and energy
efficiency, 241vs. PTX, 298RISC, 2speedup via parallelism, 263
Intel XeonAmazon Web Services, 457cache coherence, 361file system benchmarking, D-20InfiniBand, F-76multicore processor performance,
400–401performance, 400performance measurement,
405–406SMP limitations, 363SPECPower benchmarks, 463WSC processor cost-performance,
473Interactive workloads, WSC goals/
requirements, 433Interarrival times, queuing model,
D-30Interconnection networks
adaptive routing, F-93 to F-94adaptive routing and fault
tolerance, F-94arbitration, F-49, F-49 to F-50basic characteristics, F-2, F-20bisection bandwidth, F-89
commercialcongestion management, F-64
to F-66connectivity, F-62 to F-63cross-company interoperability,
F-63 to F-64DECstation 5000 reboots, F-69fault tolerance, F-66 to F-69
commercial routing/arbitration/switching, F-56
communication bandwidth, I-3compute-optimized processors vs.
receiver overhead, F-88density- vs. SPEC-optimized
processors, F-85device example, F-3direct vs. high-dimensional, F-92domains, F-3 to F5, F-4Ethernet, F-77 to F-79, F-78Ethernet/ATM total time statistics,
F-90examples, F-70HOL blocking, F-59IBM Blue Gene/L, I-43InfiniBand, F-75LAN history, F-99 to F-100link bandwidth, F-89memory hierarchy interface, F-87
to F-88mesh network routing, F-46MIN vs. direct network costs, F-92multicore single-chip
multiprocessor, 364multi-device connections
basic considerations, F-20 to F-21
effective bandwidth vs. nodes, F-28
latency vs. nodes, F-27performance characterization,
F-25 to F-29shared-media networks, F-22 to
F-24shared- vs. switched-media
networks, F-22switched-media networks, F-24topology, routing, arbitration,
switching, F-21 to F-22multi-device interconnections,
shared- vs. switched-media networks, F-24 to F-25
network-only features, F-94 to F-95
NIC vs. I/O subsystems, F-90 to F-91
OCN characteristics, F-73OCN example, F-70 to F-72OCN history, F-103 to F-104protection, F-86 to F-87routing, F-44 to F-48, F-54routing/arbitration/switching
impact, F-52 to F-55SAN characteristics, F-76software overhead, F-91 to F-92speed considerations, F-88storage area networks, F-102 to
F-103switching, F-50 to F-52switch microarchitecture, F-57
basic microarchitecture, F-55 to F-58
buffer organizations, F-58 to F-60
pipelining, F-60 to F-61, F-61switch vs. NIC, F-85 to F-86, F-86system area networks, F-72 to
F-74, F-100 to F-102system/storage area network, F-74
to F-77TCP/IP reliance, F-95top-level architecture, F-71topology, F-44
basic considerations, F-29 to F-30
Benes networks, F-33centralized switched networks,
F-30 to F-34, F-31direct networks, F-37distributed switched networks,
F-34 to F-40performance and costs, F-40performance effects, F-40 to
F-44ring network, F-36
two-device interconnectionsbasic considerations, F-5 to F-6effective bandwidth vs. packet
size, F-19example, F-6interface functions, F-6 to F-9performance, F-12 to F-20structure and functions, F-9 to
F-12
Index ■ I-39
virtual channels and throughput, F-93
WAN example, F-79WANs, F-97 to F-99wormhole switching performance,
F-92 to F-93zero-copy protocols, F-91
Intermittent faults, storage systems, D-11
Internal fragmentation, virtual memory page size selection, B-47
Internal Mask Registers, definition, 309
International Computer Architecture Symposium (ISCA), L-11 to L-12
International Mobile Telephony 2000 (IMT-2000), cell phone standards, E-25
InternetAmazon Web Services, 457array switch, 443cloud computing, 455–456, 461data-intensive applications, 344dependability, 33Google WSC, 464Layer 3 network linkage, 445Netflix traffic, 460SaaS, 4WSC efficiency, 452WSC memory hierarchy, 445WSCs, 432–433, 435, 437, 439,
446, 453–455Internet Archive Cluster
container history, L-74 to L-75overview, D-37performance, dependability, cost,
D-38 to D-40TB-80 cluster MTTF, D-40 to
D-41TB-80 VME rack, D-38
Internet Protocol (IP)internetworking, F-83storage area network history,
F-102WAN history, F-98
Internet Protocol (IP) cores, OCNs, F-3Internet Protocol (IP) routers, VOQs,
F-60Internetworking
connection example, F-80
cost, F-80definition, F-2enabling technologies, F-80 to
F-81OSI model layers, F-81, F-82protocol-level communication,
F-81 to F-82protocol stack, F-83, F-83role, F-81TCP/IP, F-81, F-83 to F-84TCP/IP headers, F-84
Interprocedural analysis, basic approach, H-10
Interprocessor communication, large-scale multiprocessors, I-3 to I-6
Interrupt, see ExceptionsInvalidate protocol
directory-based cache coherence protocol example, 382–383
example, 359, 360implementation, 356–357snooping coherence, 355, 355–356
Invalid exception, floating-point arithmetic, J-35
Inverted page table, virtual memory block identification, B-44 to B-45
I/O bandwidth, definition, D-15I/O benchmarks, response time
restrictions, D-18I/O bound workload, Virtual Machines
protection, 108I/O bus
historical background, L-80 to L-81interconnection networks, F-88point-to-point replacement, D-34Sony PlayStation 2 Emotion
Engine case study, E-15I/O cache coherency, basic
considerations, 113I/O devices
address translation, B-38average memory access time, B-17cache coherence enforcement, 354centralized shared-memory
multiprocessors, 351future GPU features, 332historical background, L-80 to
L-81
inclusion, B-34Multimedia SIMD vs. GPUs, 312multiprocessor cost effectiveness,
407performance, D-15 to D-16SANs, F-3 to F-4shared-media networks, F-23switched networks, F-2switch vs. NIC, F-86Virtual Machines impact, 110–111write strategy, B-11Xen VM, 111
I/O interfacesdisk storage, D-4storage area network history,
F-102I/O latency, shared-memory
workloads, 368–369, 371
I/O network, commercial interconnection network connectivity, F-63
IOP, see I/O processor (IOP)I/O processor (IOP)
first dynamic scheduling, L-27Sony PlayStation 2 Emotion
Engine case study, E-15I/O registers, write buffer merging, 87I/O subsystems
design, D-59 to D-61interconnection network speed,
F-88vs. NIC, F-90 to F-91zero-copy protocols, F-91
I/O systemsasynchronous, D-35as black box, D-23dirty bits, D-61 to D-64Internet Archive Cluster, see
Internet Archive Clustermultithreading history, L-34queing theory, D-23queue calculations, D-29random variable distribution, D-26utilization calculations, D-26
IP, see Intellectual Property (IP) cores; Internet Protocol (IP)
IPC, see Instructions per clock (IPC)IPoIB, F-77IR, see Instruction register (IR)ISA, see Instruction set architecture
(ISA)
I-40 ■ Index
ISCA, see International Computer Architecture Symposium (ISCA)
iSCSINetApp FAS6000 filer, D-42storage area network history, F-102
Issue logicARM Cortex-A8, 233ILP, 197longer latency pipelines, C-57multiple issue processor, 198register renaming vs. ROB, 210speculation support, 210
Issue stageID pipe stage, 170instruction steps, 174MIPS with scoreboard, C-73 to C-74out-of-order execution, C-71ROB instruction, 186
Iterative division, floating point, J-27 to J-31
JJava benchmarks
Intel Core i7, 401–405SMT on superscalar processors,
230–232without SMT, 403–404
Java languagedependence analysis, H-10hardware impact on software
development, 4return address predictors, 206SMT, 230–232, 402–405SPECjbb, 40SPECpower, 52virtual functions/methods, A-18
Java Virtual Machine (JVM)early stack architectures, L-17IBM, 463multicore processor performance,
400multithreading-based speedup, 232SPECjbb, 53
JBOD, see RAID 0Johnson, Reynold B., L-77Jump prediction
hardware model, 214ideal processor, 214
Jumpscontrol flow instructions, 14, A-16,
A-17, A-21
GPU conditional branching, 301–302
MIPS control flow instructions, A-37 to A-38
MIPS operations, A-35return address predictors, 206RISC instruction set, C-5VAX, K-71 to K-72
Just-in-time (JIT), L-17JVM, see Java Virtual Machine (JVM)
KKahle, Brewster, L-74Kahn, Robert, F-97k-ary n-cubes, definition, F-38Kendall Square Research KSR-1, L-61Kernels
arithmetic intensity, 286, 286–287, 327
benchmarks, 56bytes per reference, vs. block size,
378caches, 329commercial workload, 369–370compilers, A-24compute bandwidth, 328via computing, 327EEMBC benchmarks, 38, E-12FFT, I-7FORTRAN, compiler
vectorization, G-15FP benchmarks, C-29Livermore Fortran kernels, 331LU, I-8multimedia instructions, A-31multiprocessor architecture, 408multiprogramming workload,
375–378, 377performance benchmarks, 37, 331primitives, A-30protecting processes, B-50segmented virtual memory, B-51SIMD exploitation, 330vector, on vector processor and
GPU, 334–336virtual memory protection, 106WSCs, 438
LL1 caches, see also First-level caches
address translation, B-46Alpha 21164 hierarchy, 368
ARM Cortex-A8, 116, 116, 235ARM Cortex-A8 vs. A9, 236ARM Cortex-A8 example, 117cache optimization, B-31 to B-33case study examples, B-60, B-63 to
B-64directory-based coherence, 418Fermi GPU, 306hardware prefetching, 91hit time/power reduction, 79–80inclusion, 397–398, B-34 to B-35Intel Core i7, 118–119, 121–122,
123, 124, 124, 239, 241invalidate protocol, 355, 356–357memory consistency, 392memory hierarchy, B-39miss rates, 376–377multiprocessor cache coherence,
352multiprogramming workload, 374nonblocking cache, 85NVIDIA GPU Memory, 304Opteron memory, B-57processor comparison, 242speculative execution, 223T1 multithreading unicore
performance, 228virtual memory, B-48 to B-49
L2 caches, see also Second-level caches
ARM Cortex-A8, 114, 115–116, 235–236
ARM Cortex-A8 example, 117cache optimization, B-31 to B-33,
B-34case study example, B-63 to B-64coherency, 352commercial workloads, 373directory-based coherence, 379,
418–420, 422, 424fault detection, 58Fermi GPU, 296, 306, 308hardware prefetching, 91IBM Blue Gene/L, I-42inclusion, 397–398, B-35Intel Core i7, 118, 120–122, 124,
124–125, 239, 241invalidation protocol, 355, 356–357and ISA, 241memory consistency, 392memory hierarchy, B-39, B-48,
B-57
Index ■ I-41
L2 caches (continued )multithreading, 225, 228nonblocking cache, 85NVIDIA GPU Memory, 304processor comparison, 242snooping coherence, 359–361speculation, 223
L3 caches, see also Third-level cachesAlpha 21164 hierarchy, 368coherence, 352commercial workloads, 370, 371,
374directory-based coherence, 379, 384IBM Blue Gene/L, I-42IBM Power processors, 247inclusion, 398Intel Core i7, 118, 121, 124,
124–125, 239, 241, 403–404
invalidation protocol, 355, 356–357, 360
memory access cycle shift, 372miss rates, 373multicore processors, 400–401multithreading, 225nonblocking cache, 83performance/price/power
considerations, 52snooping coherence, 359, 361, 363
LabVIEW, embedded benchmarks, E-13
Lampson, Butler, F-99Lanes
GPUs vs. vector architectures, 310Sequence of SIMD Lane
Operations, 292, 313SIMD Lane Registers, 309, 314SIMD Lanes, 296–297, 297,
302–303, 308, 309, 311–312, 314
vector execution time, 269vector instruction set, 271–273Vector Lane Registers, 292Vector Lanes, 292, 296–297, 309,
311LANs, see Local area networks
(LANs)Large-scale multiprocessors
cache coherence implementationdeadlock and buffering, I-38 to
I-40directory controller, I-40 to I-41
DSM multiprocessor, I-36 to I-37overview, I-34 to I-36
classification, I-45cluster history, L-62 to L-63historical background, L-60 to
L-61IBM Blue Gene/L, I-41 to I-44,
I-43 to I-44interprocessor communication, I-3
to I-6for parallel programming, I-2scientific application performance
distributed-memory multiprocessors, I-26 to I-32, I-28 to I-32
parallel processors, I-33 to I-34symmetric shared-memory
multiprocessor, I-21 to I-26, I-23 to I-25
scientific applications, I-6 to I-12space and relation of classes, I-46synchronization mechanisms, I-17
to I-21synchronization performance, I-12
to I-16Latency, see also Response time
advanced directory protocol case study, 425
vs. bandwidth, 18–19, 19barrier synchronization, I-16and cache miss, B-2 to B-3cluster history, L-73communication mechanism, I-3 to
I-4definition, D-15deterministic vs. adaptive routing,
F-52 to F-55directory coherence, 425distributed-memory
multiprocessors, I-30, I-32
dynamically scheduled pipelines, C-70 to C-71
Flash memory, D-3FP operations, 157FP pipeline, C-66functional units, C-53GPU SIMD instructions, 296GPUs vs. vector architectures, 311hazards and forwarding, C-54 to
C-58hiding with speculation, 396–397
ILP exposure, 157ILP without multithreading, 225ILP for realizable processors,
216–218Intel SCCC, F-70interconnection networks, F-12 to
F-20multi-device networks, F-25 to
F-29Itanium 2 instructions, H-41microarchitectural techniques case
study, 247–254MIPS pipeline FP operations, C-52
to C-53misses, single vs. multiple thread
executions, 228multimedia instruction compiler
support, A-31NVIDIA GPU Memory structures,
305OCNs vs. SANs, F-27out-of-order processors, B-20 to
B-21packets, F-13, F-14parallel processing, 350performance milestones, 20pipeline, C-87ROB commit, 187routing, F-50routing/arbitration/switching
impact, F-52routing comparison, F-54SAN example, F-73shared-memory workloads, 368snooping coherence, 414Sony PlayStation 2 Emotion
Engine, E-17Sun T1 multithreading, 226–229switched network topology, F-40
to F-41system area network history, F-101vs. TCP/IP reliance, F-95throughput vs. response time, D-17utility computing, L-74vector memory systems, G-9vector start-up, G-8WSC efficiency, 450–452WSC memory hierarchy, 443,
443–444, 444, 445WSC processor cost-performance,
472–473WSCs vs. datacenters, 456
I-42 ■ Index
Layer 3 network, array and Internet linkage, 445
Layer 3 network, WSC memory hierarchy, 445
LCA, see Least common ancestor (LCA)
LCD, see Liquid crystal display (LCD)
Learning curve, cost trends, 27Least common ancestor (LCA),
routing algorithms, F-48Least recently used (LRU)
AMD Opteron data cache, B-12, B-14
block replacement, B-9memory hierarchy history, L-11virtual memory block replacement,
B-45Less than condition code, PowerPC,
K-10 to K-11Level 3, as Content Delivery Network,
460Limit field, IA-32 descriptor table,
B-52Line, memory hierarchy basics, 74Linear speedup
cost effectiveness, 407IBM eServer p5 multiprocessor,
408multicore processors, 400, 402performance, 405–406
Line locking, embedded systems, E-4 to E-5
Link injection bandwidthcalculation, F-17interconnection networks, F-89
Link pipelining, definition, F-16Link reception bandwidth, calculation,
F-17Link register
MIPS control flow instructions, A-37 to A-38
PowerPC instructions, K-32 to K-33
procedure invocation options, A-19
synchronization, 389Linpack benchmark
cluster history, L-63parallel processing debates, L-58vector processor example,
267–268
VMIPS performance, G-17 to G-19
Linux operating systemsAmazon Web Services, 456–457architecture costs, 2protection and ISA, 112RAID benchmarks, D-22, D-22 to
D-23WSC services, 441
Liquid crystal display (LCD), Sanyo VPC-SX500 digital camera, E-19
LISPRISC history, L-20SPARC instructions, K-30
LispILP, 215as MapReduce inspiration, 437
Literal addressing mode, basic considerations, A-10 to A-11
Little EndianIntel 80x86, K-49interconnection networks, F-12memory address interpretation,
A-7MIPS core extensions, K-20 to K-21MIPS data transfers, A-34
Little’s lawdefinition, D-24 to D-25server utilization calculation, D-29
Livelock, network routing, F-44Liveness, control dependence, 156Livermore Fortran kernels,
performance, 331, L-6LMD, see Load memory data (LMD)Load instructions
control dependences, 155data hazards requiring stalls, C-20dynamic scheduling, 177ILP, 199, 201loop-level parallelism, 318memory port conflict, C-14pipelined cache access, 82RISC instruction set, C-4 to C-5Tomasulo’s algorithm, 182VLIW sample code, 252
Load interlocksdefinition, C-37 to C-39detection logic, C-39
Load linkedlocks via coherence, 391
synchronization, 388–389Load locked, synchronization,
388–389Load memory data (LMD), simple
MIPS implementation, C-32 to C-33
Load stalls, MIPS R4000 pipeline, C-67
Load-store instruction set architecturebasic concept, C-4 to C-5IBM 360, K-87Intel Core i7, 124Intel 80x86 operations, K-62as ISA, 11ISA classification, A-5MIPS nonaligned data transfers,
K-24, K-26MIPS operations, A-35 to A-36,
A-36PowerPC, K-33RISC history, L-19simple MIPS implementation, C-32VMIPS, 265
Load/store unitFermi GPU, 305ILP hardware model, 215multiple lanes, 273Tomasulo’s algorithm, 171–173,
182, 197vector units, 265, 276–277
Load upper immediate (LUI), MIPS operations, A-37
Local address space, segmented virtual memory, B-52
Local area networks (LANs)characteristics, F-4cross-company interoperability, F-64effective bandwidth, F-18Ethernet as, F-77 to F-79fault tolerance calculations, F-68historical overview, F-99 to F-100InfiniBand, F-74interconnection network domain
relationship, F-4latency and effective bandwidth,
F-26 to F-28offload engines, F-8packet latency, F-13, F-14 to F-16routers/gateways, F-79shared-media networks, F-23storage area network history,
F-102 to F-103
Index ■ I-43
switches, F-29TCP/IP reliance, F-95time of flight, F-13topology, F-30
Locality, see Principle of localityLocal Memory
centralized shared-memory architectures, 351
definition, 292, 314distributed shared-memory, 379Fermi GPU, 306Grid mapping, 293multiprocessor architecture, 348NVIDIA GPU Memory structures,
304, 304–305SIMD, 315symmetric shared-memory
multiprocessors, 363–364
Local miss rate, definition, B-31Local node, directory-based cache
coherence protocol basics, 382
Local optimizations, compilers, A-26Local predictors, tournament
predictors, 164–166Local scheduling, ILP, VLIW
processor, 194–195Locks
via coherence, 389–391hardware primitives, 387large-scale multiprocessor
synchronization, I-18 to I-21
multiprocessor software development, 409
Lock-up free cache, 83Logical units, D-34
storage systems, D-34 to D-35Logical volumes, D-34Long displacement addressing, VAX,
K-67Long-haul networks, see Wide area
networks (WANs)Long Instruction Word (LIW)
EPIC, L-32multiple-issue processors, L-28,
L-30Long integer
operand sizes/types, 12SPEC benchmarks, A-14
Loop-carried dependences
CUDA, 290definition, 315–316dependence distance, H-6dependent computation
elimination, 321example calculations, H-4 to H-5GCD, 319loop-level parallelism, H-3as recurrence, 318recurrence form, H-5VMIPS, 268
Loop exit predictor, Intel Core i7, 166Loop interchange, compiler
optimizations, 88–89Loop-level parallelism
definition, 149–150detection and enhancement
basic approach, 315–318dependence analysis, H-6 to
H-10dependence computation
elimination, 321–322dependences, locating,
318–321dependent computation
elimination, H-10 to H-12
overview, H-2 to H-6history, L-30 to L-31ILP in perfect processor, 215ILP for realizable processors,
217–218Loop stream detection, Intel Core i7
micro-op buffer, 238Loop unrolling
basic considerations, 161–162ILP exposure, 157–161ILP limitation studies, 220recurrences, H-12software pipelining, H-12 to H-15,
H-13, H-15Tomasulo’s algorithm, 179,
181–183VLIW processors, 195
Lossless networksdefinition, F-11 to F-12switch buffer organizations, F-59
Lossy networks, definition, F-11 to F-12
LRU, see Least recently used (LRU)Lucas
compiler optimizations, A-29
data cache misses, B-10LUI, see Load upper immediate (LUI)LU kernel
characteristics, I-8distributed-memory
multiprocessor, I-32symmetric shared-memory
multiprocessors, I-22, I-23, I-25
MMAC, see Multiply-accumulate
(MAC)Machine language programmer, L-17
to L-18Machine memory, Virtual Machines,
110Macro-op fusion, Intel Core i7,
237–238Magnetic storage
access time, D-3cost vs. access time, D-3historical background, L-77 to
L-79Mail servers, benchmarking, D-20Main Memory
addressing modes, A-10address translation, B-46arithmetic intensity example, 286,
286–288block placement, B-44cache function, B-2cache optimization, B-30, B-36coherence protocol, 362definition, 292, 309DRAM, 17gather-scatter, 329GPU vs. MIMD, 327GPUs and coprocessors, 330GPU threads, 332ILP considerations, 245interlane wiring, 273linear speedups, 407memory hierarchy basics, 76memory hierarchy design, 72memory mapping, B-42MIPS operations, A-36Multimedia SIMD vs. GPUs, 312multiprocessor cache coherence,
352paging vs. segmentation, B-43partitioning, B-50
I-44 ■ Index
Main Memory (continued )processor performance
calculations, 218–219RISC code size, A-23server energy efficiency, 462symmetric shared-memory
multiprocessors, 363vector processor, G-25vs. virtual memory, B-3, B-41virtual memory block
identification, B-44 to B-45
virtual memory writes, B-45 to B-46
VLIW, 196write-back, B-11write process, B-45
Manufacturing costchip fabrication case study, 61–62cost trends, 27modern processors, 62vs. operation cost, 33
MapReducecloud computing, 455cost calculations, 458–460, 459Google usage, 437reductions, 321WSC batch processing, 437–438WSC cost-performance, 474
Mark-I, L-3 to L-4, L-6Mark-II, L-4Mark-III, L-4Mark-IV, L-4Mask Registers
basic operation, 275–276definition, 309Multimedia SIMD, 283NVIDIA GPU computational
structures, 291vector compilers, 303vector vs. GPU, 311VMIPS, 267
MasPar, L-44Massively parallel processors (MPPs)
characteristics, I-45cluster history, L-62, L-72 to
L-73system area network history, F-100
to F-101Matrix300 kernel
definition, 56prediction buffer, C-29
Matrix multiplicationbenchmarks, 56LU kernel, I-8multidimensional arrays in vector
architectures, 278Mauchly, John, L-2 to L-3, L-5, L-19Maximum transfer unit, network
interfaces, F-7 to F-8Maximum vector length (MVL)
Multimedia SIMD extensions, 282vector vs. GPU, 311VLRs, 274–275
M-bus, see Memory bus (M-bus)McCreight, Ed, F-99MCF
compiler optimizations, A-29data cache misses, B-10Intel Core i7, 240–241
MCP operating system, L-16Mean time between failures (MTBF)
fallacies, 56–57RAID, L-79SLA states, 34
Mean time to failure (MTTF)computer system power
consumption case study, 63–64
dependability benchmarks, D-21disk arrays, D-6example calculations, 34–35I/O subsystem design, D-59 to
D-61RAID reconstruction, D-55 to
D-57SLA states, 34TB-80 cluster, D-40 to D-41WSCs vs. servers, 434
Mean time to repair (MTTR)dependability benchmarks, D-21disk arrays, D-6RAID 6, D-8 to D-9RAID reconstruction, D-56
Mean time until data loss (MTDL), RAID reconstruction, D-55 to D-57
Media, interconnection networks, F-9 to F-12
Media extensions, DSPs, E-10 to E-11Mellanox MHEA28-XT, F-76Memory access
ARM Cortex-A8 example, 117basic MIPS pipeline, C-36
vs. block size, B-28cache hit calculation, B-5 to B-6Cray Research T3D, F-87data hazards requiring stalls, C-19
to C-21data hazard stall minimization,
C-17, C-19distributed-memory
multiprocessor, I-32exception stopping/restarting, C-46hazards and forwarding, C-56 to
C-57instruction set complications, C-49integrated instruction fetch units,
208MIPS data transfers, A-34MIPS exceptions, C-48 to C-49MIPS pipeline control, C-37 to
C-39MIPS R4000, C-65multimedia instruction compiler
support, A-31pipeline branch issues, C-40, C-42RISC classic pipeline, C-7, C-10shared-memory workloads, 372simple MIPS implementation,
C-32 to C-33simple RISC implementation, C-6structural hazards, C-13 to C-14vector architectures, G-10
Memory addressingALU immediate operands, A-12basic considerations, A-11 to A-13compiler-based speculation, H-32displacement values, A-12immediate value distribution, A-13interpretation, A-7 to A-8ISA, 11vector architectures, G-10
Memory banks, see also Banked memory
gather-scatter, 280multiprocessor architecture, 347parallelism, 45shared-memory multiprocessors,
363strides, 279vector load/store unit bandwidth,
276–277vector systems, G-9 to G-11
Memory bus (M-bus)definition, 351
Index ■ I-45
Google WSC servers, 469interconnection networks, F-88
Memory consistencybasic considerations, 392–393cache coherence, 352compiler optimization, 396development of models, L-64directory-based cache coherence
protocol basics, 382multiprocessor cache coherency,
353relaxed consistency models,
394–395single-chip multicore processor
case study, 412–418speculation to hide latency, 396–397
Memory-constrained scaling, scientific applications on parallel processors, I-33
Memory hierarchyaddress space, B-57 to B-58basic questions, B-6 to B-12block identification, B-7 to B-9block placement issues, B-7block replacement, B-9 to B-10cache optimization
basic categories, B-22basic optimizations, B-40hit time reduction, B-36 to
B-40miss categories, B-23 to B-26miss penalty reduction
via multilevel caches, B-30 to B-35
read misses vs. writes, B-35 to B-36
miss rate reductionvia associativity, B-28 to
B-30via block size, B-26 to B-28via cache size, B-28
pipelined cache access, 82cache performance, B-3 to B-6
average memory access time, B-17 to B-20
basic considerations, B-16basic equations, B-22example calculation, B-16out-of-order processors, B-20
to B-22case studies, B-60 to B-67
development, L-9 to L-12inclusion, 397–398interconnection network
protection, F-87 to F-88levels in slow down, B-3Opteron data cache example, B-12
to B-15, B-13Opteron L1/L2, B-57OS and page size, B-58overview, B-39Pentium vs. Opteron protection,
B-57processor examples, B-3process protection, B-50terminology, B-2 to B-3virtual memory
basic considerations, B-40 to B-44, B-48 to B-49
basic questions, B-44 to B-46fast address translation, B-46overview, B-48paged example, B-54 to B-57page size selection, B-46 to
B-47segmented example, B-51 to
B-54write strategy, B-10 to B-12WSCs, 443, 443–446, 444
Memory hierarchy designaccess times, 77Alpha 21264 floorplan, 143ARM Cortex-A8 example,
114–117, 115–117cache coherency, 112–113cache optimization
case study, 131–133compiler-controlled
prefetching, 92–95compiler optimizations, 87–90critical word first, 86–87energy consumption, 81hardware instruction
prefetching, 91–92, 92multibanked caches, 85–86, 86nonblocking caches, 83–85, 84overview, 78–79pipelined cache access, 82techniques overview, 96way prediction, 81–82write buffer merging, 87, 88
cache performance prediction, 125–126
cache size and misses per instruction, 126
DDR2 SDRAM timing diagram, 139
highly parallel memory systems, 133–136
high memory bandwidth, 126instruction miss benchmarks, 127instruction simulation, 126Intel Core i7, 117–124, 119,
123–125Intel Core i7 three-level cache
hierarchy, 118Intel Core i7 TLB structure, 118Intel 80x86 virtualization issues,
128memory basics, 74–78overview, 72–74protection and ISA, 112server vs. PMD, 72system call virtualization/
paravirtualization performance, 141
virtual machine monitor, 108–109Virtual Machines ISA support,
109–110Virtual Machines protection,
107–108Virtual Machines and virtual
memory and I/O, 110–111
virtual memory protection, 105–107
VMM on nonvirtualizable ISA, 128–129
Xen VM example, 111Memory Interface Unit
NVIDIA GPU ISA, 300vector processor example, 310
Memoryless, definition, D-28Memory mapping
memory hierarchy, B-48 to B-49segmented virtual memory, B-52TLBs, 323virtual memory definition, B-42
Memory-memory instruction set architecture, ISA classification, A-3, A-5
Memory protectioncontrol dependence, 155Pentium vs. Opteron, B-57processes, B-50
I-46 ■ Index
Memory protection (continued )safe calls, B-54segmented virtual memory
example, B-51 to B-54virtual memory, B-41
Memory stall cyclesaverage memory access time, B-17definition, B-4 to B-5miss rate calculation, B-6out-of-order processors, B-20 to
B-21performance equations, B-22
Memory systemcache optimization, B-36coherency, 352–353commercial workloads, 367,
369–371computer architecture, 15C program evaluation, 134–135dependability enhancement,
104–105distributed shared-memory, 379, 418gather-scatter, 280GDRAMs, 323GPUs, 332ILP, 245
hardware vs. software speculation, 221–222
speculative execution, 222–223Intel Core i7, 237, 242latency, B-21MIPS, C-33multiprocessor architecture, 347multiprocessor cache coherence,
352multiprogramming workload,
377–378page size changes, B-58price/performance/power
considerations, 53RISC, C-7Roofline model, 286shared-memory multiprocessors,
363SMT, 399–400stride handling, 279T1 multithreading unicore
performance, 227vector architectures, G-9 to G-11vector chaining, G-11vector processors, 271, 277virtual, B-43, B-46
Memory technology basicsDRAM, 98, 98–100, 99DRAM and DIMM characteristics,
101DRAM performance, 100–102Flash memory, 102–104overview, 96–97performance trends, 20SDRAM power consumption, 102,
103SRAM, 97–98
Mesh interface unit (MIU), Intel SCCC, F-70
Mesh networkcharacteristics, F-73deadlock, F-47dimension-order routing, F-47 to
F-48OCN history, F-104routing example, F-46
Mesh topologycharacteristics, F-36direct networks, F-37NEWS communication, F-42 to
F-43MESI, see Modified-Exclusive-
Shared-Invalid (MESI) protocol
Message ID, packet header, F-8, F-16Message-passing communication
historical background, L-60 to L-61
large-scale multiprocessors, I-5 to I-6
Message Passing Interface (MPI)function, F-8InfiniBand, F-77lack in shared-memory
multiprocessors, I-5Messages
adaptive routing, F-93 to F-94coherence maintenance, 381InfiniBand, F-76interconnection networks, F-6 to
F-9zero-copy protocols, F-91
MFLOPS, see Millions of floating-point operations per second (MFLOPS)
Microarchitectureas architecture component, 15–16
ARM Cortex-A8, 241Cray X1, G-21 to G-22data hazards, 168ILP exploitation, 197Intel Core i7, 236–237Nehalem, 411OCNs, F-3out-of-order example, 253PTX vs. x86, 298switches, see Switch
microarchitecturetechniques case study, 247–254
Microbenchmarksdisk array deconstruction, D-51 to
D-55disk deconstruction, D-48 to D-51
Microfusion, Intel Core i7 micro-op buffer, 238
Microinstructionscomplications, C-50 to C-51x86, 298
Micro-opsIntel Core i7, 237, 238–240, 239processor clock rates, 244
Microprocessor overviewclock rate trends, 24cost trends, 27–28desktop computers, 6embedded computers, 8–9energy and power, 23–26inside disks, D-4integrated circuit improvements, 2and Moore’s law, 3–4performance trends, 19–20, 20power and energy system trends,
21–23recent advances, L-33 to L-34technology trends, 18
Microprocessor without Interlocked Pipeline Stages, see MIPS (Microprocessor without Interlocked Pipeline Stages)
Microsoftcloud computing, 455containers, L-74Intel support, 245WSCs, 464–465
Microsoft Azure, 456, L-74Microsoft DirectX, L-51 to L-52Microsoft Windows
benchmarks, 38
Index ■ I-47
multithreading, 223RAID benchmarks, D-22, D-22 to
D-23time/volume/commoditization
impact, 28WSC workloads, 441
Microsoft Windows 2008 Serverreal-world considerations, 52–55SPECpower benchmark, 463
Microsoft XBox, L-51Migration, cache coherent
multiprocessors, 354Millions of floating-point operations
per second (MFLOPS)early performance measures, L-7parallel processing debates, L-57
to L-58SIMD computer history, L-55SIMD supercomputer
development, L-43vector performance measures,
G-15 to G-16MIMD (Multiple Instruction Streams,
Multiple Data Streams)and Amdahl’s law, 406–407definition, 10early computers, L-56first vector computers, L-46, L-48GPU programming, 289GPUs vs. vector architectures, 310with Multimedia SIMD, vs. GPU,
324–330multiprocessor architecture,
346–348speedup via parallelism, 263TLP, basic considerations, 344–345
Minicomputers, replacement by microprocessors, 3–4
Minniespec benchmarksARM Cortex-A8, 116, 235ARM Cortex-A8 memory,
115–116MINs, see Multistage interconnection
networks (MINs)MIPS (Microprocessor without
Interlocked Pipeline Stages)
addressing modes, 11–12basic pipeline, C-34 to C-36branch predictor correlation, 163cache performance, B-6conditional branches, K-11
conditional instructions, H-27control flow instructions, 14data dependences, 151data hazards, 169dynamic scheduling with
Tomasulo’s algorithm, 171, 173
early pipelined CPUs, L-26embedded systems, E-15encoding, 14exceptions, C-48, C-48 to C-49exception stopping/restarting, C-46
to C-47features, K-44FP pipeline performance, C-60 to
C-61, C-62FP unit with Tomasulo’s
algorithm, 173hazard checks, C-71ILP, 149ILP exposure, 157–158ILP hardware model, 215instruction execution issues, K-81instruction formats, core
instructions, K-6instruction set complications, C-49
to C-51ISA class, 11ISA example
addressing modes for data transfer, A-34
arithmetic/logical instructions, A-37
basic considerations, A-32 to A-33
control flow instructions, A-37 to A-38, A-38
data types, A-34dynamic instruction mix, A-41,
A-41 to A-42, A-42FP operations, A-38 to A-39instruction format, A-35load-store instructions, A-36MIPS operations, A-35 to A-37registers, A-34usage, A-39
Livermore Fortran kernel performance, 331
memory addressing, 11multicycle operations
basic considerations, C-51 to C-54
hazards and forwarding, C-54 to C-58
precise exceptions, C-58 to C-60
multimedia support, K-19multiple-issue processor history,
L-29operands, 12performance measurement history,
L-6 to L-7pipeline branch issues, C-39 to
C-42pipeline control, C-36 to C-39pipe stage, C-37processor performance
calculations, 218–219registers and usage conventions, 12RISC code size, A-23RISC history, L-19RISC instruction set lineage, K-43as RISC systems, K-4scoreboard components, C-76scoreboarding, C-72scoreboarding steps, C-73, C-73 to
C-74simple implementation, C-31 to
C-34, C-34Sony PlayStation 2 Emotion
Engine, E-17unaligned word read instructions,
K-26unpipelined functional units, C-52vs. VAX, K-65 to K-66, K-75,
K-82write strategy, B-10
MIPS16addressing modes, K-6arithmetic/logical instructions,
K-24characteristics, K-4constant extension, K-9data transfer instructions, K-23embedded instruction format, K-8instructions, K-14 to K-16multiply-accumulate, K-20RISC code size, A-23unique instructions, K-40 to K-42
MIPS32, vs. VAX sort, K-80MIPS64
addressing modes, K-5arithmetic/logical instructions,
K-11
I-48 ■ Index
MIPS64 (continued )conditional branches, K-17constant extension, K-9conventions, K-13data transfer instructions, K-10FP instructions, K-23instruction list, K-26 to K-27instruction set architecture formats,
14instruction subset, 13, A-40in MIPS R4000, C-61nonaligned data transfers, K-24 to
K-26RISC instruction set, C-4
MIPS2000, instruction benchmarks, K-82
MIPS 3010, chip layout, J-59MIPS core
compare and conditional branch, K-9 to K-16
equivalent RISC instructionsarithmetic/logical, K-11arithmetic/logical instructions,
K-15common extensions, K-19 to
K-24control instructions, K-12, K-16conventions, K-16data transfers, K-10embedded RISC data transfers,
K-14FP instructions, K-13
instruction formats, K-9MIPS M2000, L-21, L-21MIPS MDMX
characteristics, K-18multimedia support, K-18
MIPS R2000, L-20MIPS R3000
integer arithmetic, J-12integer overflow, J-11
MIPS R3010arithmetic functions, J-58 to J-61chip comparison, J-58floating-point exceptions, J-35
MIPS R4000early pipelined CPUs, L-27FP pipeline, C-65 to C-67, C-66integer pipeline, C-63pipeline overview, C-61 to C-65pipeline performance, C-67 to
C-70
pipeline structure, C-62 to C-63MIPS R8000, precise exceptions, C-59MIPS R10000, 81
latency hiding, 397precise exceptions, C-59
Misalignment, memory address interpretation, A-7 to A-8, A-8
MISD, see Multiple Instruction Streams, Single Data Stream
Misprediction ratebranch-prediction buffers, C-29predictors on SPEC89, 166profile-based predictor, C-27SPECCPU2006 benchmarks, 167
MispredictionsARM Cortex-A8, 232, 235branch predictors, 164–167, 240,
C-28branch-target buffers, 205hardware-based speculation, 190hardware vs. software speculation,
221integer vs. FP programs, 212Intel Core i7, 237prediction buffers, C-29static branch prediction, C-26 to
C-27Misses per instruction
application/OS statistics, B-59cache performance, B-5 to B-6cache protocols, 359cache size effect, 126L3 cache block size, 371memory hierarchy basics, 75performance impact calculations,
B-18shared-memory workloads, 372SPEC benchmarks, 127strided access-TLB interactions,
323Miss penalty
average memory access time, B-16 to B-17
cache optimization, 79, B-35 to B-36
cache performance, B-4, B-21compiler-controlled prefetching,
92–95critical word first, 86–87hardware prefetching, 91–92
ILP speculative execution, 223memory hierarchy basics, 75–76nonblocking cache, 83out-of-order processors, B-20 to
B-22processor performance
calculations, 218–219reduction via multilevel caches,
B-30 to B-35write buffer merging, 87
Miss rateAMD Opteron data cache, B-15ARM Cortex-A8, 116average memory access time, B-16
to B-17, B-29basic categories, B-23vs. block size, B-27cache optimization, 79
and associativity, B-28 to B-30and block size, B-26 to B-28and cache size, B-28
cache performance, B-4and cache size, B-24 to B-25compiler-controlled prefetching,
92–95compiler optimizations, 87–90early IBM computers, L-10 to L-11example calculations, B-6, B-31 to
B-32hardware prefetching, 91–92Intel Core i7, 123, 125, 241memory hierarchy basics, 75–76multilevel caches, B-33processor performance
calculations, 218–219scientific workloads
distributed-memory multiprocessors, I-28 to I-30
symmetric shared-memory multiprocessors, I-22, I-23 to I-25
shared-memory multiprogramming workload, 376, 376–377
shared-memory workload, 370–373
single vs. multiple thread executions, 228
Sun T1 multithreading unicore performance, 228
vs. virtual addressed cache size, B-37
Index ■ I-49
MIT Raw, characteristics, F-73Mitsubishi M32R
addressing modes, K-6arithmetic/logical instructions,
K-24characteristics, K-4condition codes, K-14constant extension, K-9data transfer instructions, K-23embedded instruction format, K-8multiply-accumulate, K-20unique instructions, K-39 to K-40
MIU, see Mesh interface unit (MIU)Mixed cache
AMD Opteron example, B-15commercial workload, 373
Mixer, radio receiver, E-23Miya, Eugene, L-65M/M/1 model
example, D-32, D-32 to D-33overview, D-30RAID performance prediction, D-57sample calculations, D-33
M/M/2 model, RAID performance prediction, D-57
MMX, see Multimedia Extensions (MMX)
Mobile clientsdata usage, 3GPU features, 324vs. server GPUs, 323–330
Modified-Exclusive-Shared-Invalid (MESI) protocol, characteristics, 362
Modified-Owned-Exclusive-Shared-Invalid (MOESI) protocol, characteristics, 362
Modified statecoherence protocol, 362directory-based cache coherence
protocol basics, 380large-scale multiprocessor cache
coherence, I-35snooping coherence protocol,
358–359Modula-3, integer division/remainder,
J-12Module availability, definition, 34Module reliability, definition, 34MOESI, see Modified-Owned-
Exclusive-Shared-Invalid (MOESI) protocol
Moore’s lawDRAM, 100flawed architectures, A-45interconnection networks, F-70and microprocessor dominance,
3–4point-to-point links and switches,
D-34RISC, A-3RISC history, L-22software importance, 55switch size, F-29technology trends, 17
Mortar shot graphs, multiprocessor performance measurement, 405–406
Motion JPEG encoder, Sanyo VPC-SX500 digital camera, E-19
Motorola 68000characteristics, K-42memory protection, L-10
Motorola 68882, floating-point precisions, J-33
Move address, VAX, K-70MPEG
Multimedia SIMD Extensions history, L-49
multimedia support, K-17Sanyo VPC-SX500 digital camera,
E-19Sony PlayStation 2 Emotion
Engine, E-17MPI, see Message Passing Interface
(MPI)MPPs, see Massively parallel
processors (MPPs)MSP, see Multi-Streaming Processor
(MSP)MTBF, see Mean time between
failures (MTBF)MTDL, see Mean time until data loss
(MTDL)MTTF, see Mean time to failure (MTTF)MTTR, see Mean time to repair (MTTR)Multibanked caches
cache optimization, 85–86example, 86
Multichip modules, OCNs, F-3Multicomputers
cluster history, L-63definition, 345, L-59
historical background, L-64 to L-65
Multicore processorsarchitecture goals/requirements, 15cache coherence, 361–362centralized shared-memory
multiprocessor structure, 347
Cray X1E, G-24directory-based cache coherence,
380directory-based coherence, 381,
419DSM architecture, 348, 379multichip
cache and memory states, 419with DSM, 419
multiprocessors, 345OCN history, F-104performance, 400–401, 401performance gains, 398–400performance milestones, 20single-chip case study, 412–418and SMT, 404–405snooping cache coherence
implementation, 365SPEC benchmarks, 402uniform memory access, 364write invalidate protocol
implementation, 356–357
Multics protection software, L-9Multicycle operations, MIPS pipeline
basic considerations, C-51 to C-54hazards and forwarding, C-54 to
C-58precise exceptions, C-58 to C-60
Multidimensional arraysdependences, 318in vector architectures, 278–279
Multiflow processor, L-30, L-32Multigrid methods, Ocean application,
I-9 to I-10Multilevel caches
cache optimizations, B-22centralized shared-memory
architectures, 351memory hierarchy basics, 76memory hierarchy history, L-11miss penalty reduction, B-30 to
B-35miss rate vs. cache size, B-33
I-50 ■ Index
Multilevel caches (continued)Multimedia SIMD vs. GPU, 312performance equations, B-22purpose, 397write process, B-11
Multilevel exclusion, definition, B-35Multilevel inclusion
definition, 397, B-34implementation, 397memory hierarchy history, L-11
Multimedia applicationsdesktop processor support,
E-11GPUs, 288ISA support, A-46MIPS FP operations, A-39vector architectures, 267
Multimedia Extensions (MMX)compiler support, A-31desktop RISCs, K-18desktop/server RISCs, K-16 to
K-19SIMD history, 262, L-50vs. vector architectures, 282–283
Multimedia instructionsARM Cortex-A8, 236compiler support, A-31 to A-32
Multimedia SIMD Extensionsbasic considerations, 262, 282–284compiler support, A-31DLP, 322DSPs, E-11vs. GPUs, 312historical background, L-49 to
L-50MIMD, vs. GPU, 324–330parallelism classes, 10programming, 285Roofline visual performance
model, 285–288, 287256-bit-wide operations, 282vs. vector, 263–264
Multimedia user interfaces, PMDs, 6Multimode fiber, interconnection
networks, F-9Multipass array multiplier, example,
J-51Multiple Instruction Streams, Multiple
Data Streams, see MIMD (Multiple Instruction Streams, Multiple Data Streams)
Multiple Instruction Streams, Single Data Stream (MISD), definition, 10
Multiple-issue processorsbasic VLIW approach, 193–196with dynamic scheduling and
speculation, 197–202early development, L-28 to L-30instruction fetch bandwidth,
202–203integrated instruction fetch units,
207loop unrolling, 162microarchitectural techniques case
study, 247–254primary approaches, 194SMT, 224, 226with speculation, 198Tomasulo’s algorithm, 183
Multiple lanes techniquevector instruction set, 271–273vector performance, G-7 to G-9vector performance calculations,
G-8Multiple paths, ILP limitation studies,
220Multiple-precision addition, J-13Multiply-accumulate (MAC)
DSP, E-5embedded RISCs, K-20TI TMS320C55 DSP, E-8
Multiply operationschip comparison, J-61floating point
denormals, J-20 to J-21examples, J-19multiplication, J-17 to J-20precision, J-21rounding, J-18, J-19
integer arithmeticarray multiplier, J-50Booth recoding, J-49even/odd array, J-52issues, J-11with many adders, J-50 to J-54multipass array multiplier, J-51n-bit unsigned integers, J-4Radix-2, J-4 to J-7signed-digit addition table,
J-54with single adder, J-47 to J-49,
J-48
Wallace tree, J-53integer shifting over zeros, J-45 to
J-47PA-RISC instructions, K-34 to
K-35unfinished instructions, 179
Multiprocessor basicsarchitectural issues and
approaches, 346–348architecture goals/requirements, 15architecture and software
development, 407–409basic hardware primitives,
387–389cache coherence, 352–353coining of term, L-59communication calculations, 350computer categories, 10consistency models, 395definition, 345early machines, L-56embedded systems, E-14 to E-15fallacies, 55locks via coherence, 389–391low-to-high-end roles, 344–345parallel processing challenges,
349–351for performance gains, 398–400performance trends, 21point-to-point example, 413shared-memory, see
Shared-memory multiprocessors
SMP, 345, 350, 354–355, 363–364streaming Multiprocessor, 292,
307, 313–314Multiprocessor history
bus-based coherent multiprocessors, L-59 to L-60
clusters, L-62 to L-64early computers, L-56large-scale multiprocessors, L-60
to L-61parallel processing debates, L-56
to L-58recent advances and developments,
L-58 to L-60SIMD computers, L-55 to L-56synchronization and consistency
models, L-64virtual memory, L-64
Index ■ I-51
Multiprogrammingdefinition, 345multithreading, 224performance, 36shared-memory workload
performance, 375–378, 377
shared-memory workloads, 374–375
software optimization, 408virtual memory-based protection,
105–106, B-49workload execution time, 375
Multistage interconnection networks (MINs)
bidirectional, F-33 to F-34crossbar switch calculations, F-31
to F-32vs. direct network costs, F-92example, F-31self-routing, F-48system area network history, F-100
to F-101topology, F-30 to F-31, F-38 to
F-39Multistage switch fabrics, topology,
F-30Multi-Streaming Processor (MSP)
Cray X1, G-21 to G-23, G-22, G-23 to G-24
Cray X1E, G-24first vector computers, L-46
Multithreaded SIMD Processorblock diagram, 294definition, 292, 309, 313–314Fermi GPU architectural
innovations, 305–308Fermi GPU block diagram, 307Fermi GTX 480 GPU floorplan,
295, 295–296GPU programming, 289–290GPUs vs. vector architectures, 310,
310–311Grid mapping, 293NVIDIA GPU computational
structures, 291NVIDIA GPU Memory structures,
304, 304–305Roofline model, 326
Multithreaded vector processordefinition, 292Fermi GPU comparison, 305
Multithreadingcoarse-grained, 224–226definition and types, 223–225fine-grained, 224–226GPU programming, 289historical background, L-34 to
L-35ILP, 223–232memory hierarchy basics, 75–76parallel benchmarks, 231, 231–232for performance gains, 398–400SMT, see Simultaneous
multithreading (SMT)Sun T1 effectiveness, 226–229
MVAPICH, F-77MVL, see Maximum vector length
(MVL)MXP processor, components, E-14Myrinet SAN, F-67
characteristics, F-76cluster history, L-62 to L-63, L-73routing algorithms, F-48switch vs. NIC, F-86system area network history, F-100
NNAK, see Negative acknowledge
(NAK)Name dependences
ILP, 152–153locating dependences, 318–319loop-level parallelism, 315scoreboarding, C-79Tomasulo’s algorithm, 171–172
Nameplate power rating, WSCs, 449NaN (Not a Number), J-14, J-16, J-21,
J-34NAND Flash, definition, 103NAS, see Network attached storage
(NAS)NAS Parallel Benchmarks
InfiniBand, F-76vector processor history, G-28
National Science Foundation, WAN history, F-98
Natural parallelismembedded systems, E-15multiprocessor importance, 344multithreading, 223
n-bit adder, carry-lookahead, J-38n-bit number representation, J-7 to
J-10
n-bit unsigned integer division, J-4N-body algorithms, Barnes
application, I-8 to I-9NBS DYSEAC, L-81N-cube topology, characteristics, F-36NEC Earth Simulator, peak
performance, 58NEC SX/2, L-45, L-47NEC SX/5, L-46, L-48NEC SX/6, L-46, L-48NEC SX-8, L-46, L-48NEC SX-9
first vector computers, L-49Roofline model, 286–288, 287
NEC VR 4122, embedded benchmarks, E-13
Negative acknowledge (NAK)cache coherence, I-38 to I-39directory controller, I-40 to I-41DSM multiprocessor cache
coherence, I-37Negative condition code, MIPS core,
K-9 to K-16Negative-first routing, F-48Nested page tables, 129NetApp, see Network Appliance
(NetApp)Netflix, AWS, 460Netscape, F-98Network Appliance (NetApp)
FAS6000 filer, D-41 to D-42NFS benchmarking, D-20RAID, D-9RAID row-diagonal parity, D-9
Network attached storage (NAS)block servers vs. filers, D-35WSCs, 442
Network bandwidth, interconnection network, F-18
Network-Based Computer Laboratory (Ohio State), F-76, F-77
Network buffers, network interfaces, F-7 to F-8
Network fabric, switched-media networks, F-24
Network File System (NFS)benchmarking, D-20, D-20block servers vs. filers, D-35interconnection networks, F-89server benchmarks, 40TCP/IP, F-81
I-52 ■ Index
Networking costs, WSC vs. datacenters, 455
Network injection bandwidthinterconnection network, F-18multi-device interconnection
networks, F-26Network interface
fault tolerance, F-67functions, F-6 to F-7message composition/processing,
F-6 to F-9Network interface card (NIC)
functions, F-8Google WSC servers, 469vs. I/O subsystem, F-90 to F-91storage area network history,
F-102vs. switches, F-85 to F-86, F-86zero-copy protocols, F-91
Network layer, definition, F-82Network nodes
direct network topology, F-37distributed switched networks,
F-34 to F-36Network on chip (NoC),
characteristics, F-3Network ports, interconnection
network topology, F-29Network protocol layer,
interconnection networks, F-10
Network reception bandwidth, interconnection network, F-18
Network reconfigurationcommercial interconnection
networks, F-66fault tolerance, F-67switch vs. NIC, F-86
Network technology, see also Interconnection networks
Google WSC, 469performance trends, 19–20personal computers, F-2trends, 18WSC bottleneck, 461WSC goals/requirements, 433
Network of Workstations, L-62, L-73NEWS communication, see
North-East-West-South communication
Newton’s iteration, J-27 to J-30NFS, see Network File System (NFS)NIC, see Network interface card (NIC)Nicely, Thomas, J-64NMOS, DRAM, 99NoC, see Network on chip (NoC)Nodes
coherence maintenance, 381communication bandwidth, I-3direct network topology, F-37directory-based cache coherence,
380distributed switched networks,
F-34 to F-36IBM Blue Gene/L, I-42 to I-44IBM Blue Gene/L 3D torus
network, F-73network topology performance and
costs, F-40in parallel, 336points-to analysis, H-9
Nokia cell phone, circuit board, E-24Nonaligned data transfers, MIPS64,
K-24 to K-26Nonatomic operations
cache coherence, 361directory protocol, 386
Nonbinding prefetch, cache optimization, 93
Nonblocking cachescache optimization, 83–85,
131–133effectiveness, 84ILP speculative execution,
222–223Intel Core i7, 118memory hierarchy history, L-11
Nonblocking crossbar, centralized switched networks, F-32 to F-33
Nonfaulting prefetches, cache optimization, 92
Nonrestoring division, J-5, J-6Nonuniform memory access
(NUMA)DSM as, 348large-scale multiprocessor history,
L-61snooping limitations, 363–364
Non-unit stridesmultidimensional arrays in vector
architectures, 278–279
vector processor, 310, 310–311, G-25
North-East-West-South communication, network topology calculations, F-41 to F-43
North-last routing, F-48Not a Number (NaN), J-14, J-16, J-21,
J-34Notifications, interconnection
networks, F-10NOW project, L-73No-write allocate
definition, B-11example calculation, B-12
NSFNET, F-98NTSC/PAL encoder, Sanyo
VPC-SX500 digital camera, E-19
Nullification, PA-RISC instructions, K-33 to K-34
Nullifying branch, branch delay slots, C-24 to C-25
NUMA, see Nonuniform memory access (NUMA)
NVIDIA GeForce, L-51NVIDIA systems
fine-grained multithreading, 224GPU comparisons, 323–330,
325GPU computational structures,
291–297GPU computing history, L-52GPU ISA, 298–300GPU Memory structures, 304,
304–305GPU programming, 289graphics pipeline history, L-51scalable GPUs, L-51terminology, 313–315
N-way set associativeblock placement, B-7conflict misses, B-23memory hierarchy basics, 74TLBs, B-49
NYU Ultracomputer, L-60
OObserved performance, fallacies, 57Occupancy, communication
bandwidth, I-3
Index ■ I-53
Ocean applicationcharacteristics, I-9 to I-10distributed-memory
multiprocessor, I-32distributed-memory
multiprocessors, I-30example calculations, I-11 to I-12miss rates, I-28symmetric shared-memory
multiprocessors, I-23OCNs, see On-chip networks (OCNs)Offline reconstruction, RAID, D-55Offload engines
network interfaces, F-8TCP/IP reliance, F-95
Offsetaddressing modes, 12AMD64 paged virtual memory,
B-55block identification, B-7 to B-8cache optimization, B-38call gates, B-54control flow instructions, A-18directory-based cache coherence
protocols, 381–382example, B-9gather-scatter, 280IA-32 segment, B-53instruction decode, C-5 to C-6main memory, B-44memory mapping, B-52MIPS, C-32MIPS control flow instructions,
A-37 to A-38misaligned addresses, A-8Opteron data cache, B-13 to B-14pipelining, C-42PTX instructions, 300RISC, C-4 to C-6RISC instruction set, C-4TLB, B-46Tomasulo’s approach, 176virtual memory, B-43 to B-44,
B-49, B-55 to B-56OLTP, see On-Line Transaction
Processing (OLTP)Omega
example, F-31packet blocking, F-32topology, F-30
OMNETPP, Intel Core i7, 240–241On-chip cache
optimization, 79SRAM, 98–99
On-chip memory, embedded systems, E-4 to E-5
On-chip networks (OCNs)basic considerations, F-3commercial implementations, F-73commercial interconnection
networks, F-63cross-company interoperability,
F-64DOR, F-46effective bandwidth, F-18,
F-28example system, F-70 to F-72historical overview, F-103 to
F-104interconnection network domain
relationship, F-4interconnection network speed,
F-88latency and effective bandwidth,
F-26 to F-28latency vs. nodes, F-27link bandwidth, F-89packet latency, F-13, F-14 to F-16switch microarchitecture, F-57time of flight, F-13topology, F-30wormhole switching, F-51
One’s complement, J-7One-way conflict misses, definition,
B-23Online reconstruction, RAID, D-55On-Line Transaction Processing
(OLTP)commercial workload, 369, 371server benchmarks, 41shared-memory workloads,
368–370, 373–374storage system benchmarks, D-18
OpenCLGPU programming, 289GPU terminology, 292, 313–315NVIDIA terminology, 291processor comparisons, 323
OpenGL, L-51Open source software
Amazon Web Services, 457WSCs, 437Xen VMM, see Xen virtual
machine
Open Systems Interconnect (OSI)Ethernet, F-78 to F-79layer definitions, F-82
Operand addressing mode, Intel 80x86, K-59, K-59 to K-60
Operand delivery stage, Itanium 2, H-42
OperandsDSP, E-6forwarding, C-19instruction set encoding, A-21 to
A-22Intel 80x86, K-59ISA, 12ISA classification, A-3 to A-4MIPS data types, A-34MIPS pipeline, C-71MIPS pipeline FP operations, C-52
to C-53NVIDIA GPU ISA, 298per ALU instruction example, A-6TMS320C55 DSP, E-6type and size, A-13 to A-14VAX, K-66 to K-68, K-68vector execution time, 268–269
Operating systems (general)address translation, B-38and architecture development, 2communication performance, F-8disk access scheduling, D-44 to
D-45, D-45memory protection performance,
B-58miss statistics, B-59multiprocessor software
development, 408and page size, B-58segmented virtual memory, B-54server benchmarks, 40shared-memory workloads,
374–378storage systems, D-35
Operational costsbasic considerations, 33WSCs, 434, 438, 452, 456, 472
Operational expenditures (OPEX)WSC costs, 452–455, 454WSC TCO case study, 476–478
Operation faults, storage systems, D-11Operator dependability, disks, D-13 to
D-15
I-54 ■ Index
OPEX, see Operational expenditures (OPEX)
Optical media, interconnection networks, F-9
Oracle databasecommercial workload, 368miss statistics, B-59multithreading benchmarks, 232single-threaded benchmarks, 243WSC services, 441
Ordering, and deadlock, F-47Organization
buffer, switch microarchitecture, F-58 to F-60
cache, performance impact, B-19
cache blocks, B-7 to B-8cache optimization, B-19coherence extensions, 362computer architecture, 11, 15–16DRAM, 98MIPS pipeline, C-37multiple-issue processor, 197, 198Opteron data cache, B-12 to B-13,
B-13pipelines, 152processor history, 2–3processor performance equation,
49shared-memory multiprocessors,
346Sony PlayStation Emotion Engine,
E-18TLB, B-46
Orthogonality, compiler writing-architecture relationship, A-30
OSI, see Open Systems Interconnect (OSI)
Out-of-order completiondata hazards, 169MIPS pipeline, C-71MIPS R100000 sequential
consistency, 397precise exceptions, C-58
Out-of-order executionand cache miss, B-2 to B-3cache performance, B-21data hazards, 169–170hardware-based execution, 184ILP, 245memory hierarchy, B-2 to B-3
microarchitectural techniques case study, 247–254
MIPS pipeline, C-71miss penalty, B-20 to B-22performance milestones, 20power/DLP issues, 322processor comparisons, 323R10000, 397SMT, 246Tomasulo’s algorithm, 183
Out-of-order processorsDLP, 322Intel Core i7, 236memory hierarchy history, L-11multithreading, 226vector architecture, 267
Out-of-order write, dynamic scheduling, 171
Output buffered switchHOL blocking, F-60microarchitecture, F-57, F-57organizations, F-58 to F-59pipelined version, F-61
Output dependencecompiler history, L-30 to L-31definition, 152–153dynamic scheduling, 169–171, C-72finding, H-7 to H-8loop-level parallelism calculations,
320MIPS scoreboarding, C-79
Overclockingmicroprocessors, 26processor performance equation,
52Overflow, integer arithmetic, J-8, J-10
to J-11, J-11Overflow condition code, MIPS core,
K-9 to K-16Overhead
adaptive routing, F-93 to F-94Amdahl’s law, F-91communication latency, I-4interconnection networks, F-88,
F-91 to F-92OCNs vs. SANs, F-27vs. peak performance, 331shared-memory communication,
I-5sorting case study, D-64 to D-67time of flight, F-14vector processor, G-4
Overlapping tripletshistorical background, J-63integer multiplication, J-49
Oversubscriptionarray switch, 443Google WSC, 469WSC architecture, 441, 461
PPacked decimal, definition, A-14Packet discarding, congestion
management, F-65Packets
ATM, F-79bidirectional rings, F-35 to F-36centralized switched networks,
F-32effective bandwidth vs. packet size,
F-19format example, F-7IBM Blue Gene/L 3D torus
network, F-73InfiniBand, F-75, F-76interconnection networks,
multi-device networks, F-25
latency issues, F-12, F-13lossless vs. lossy networks, F-11 to
F-12network interfaces, F-8 to F-9network routing, F-44routing/arbitration/switching
impact, F-52switched network topology, F-40switching, F-51switch microarchitecture, F-57 to
F-58switch microarchitecture
pipelining, F-60 to F-61TI TMS320C6x DSP, E-10topology, F-21virtual channels and throughput,
F-93Packet transport, interconnection
networks, F-9 to F-12Page coloring, definition, B-38Paged segments, characteristics, B-43
to B-44Paged virtual memory
Opteron example, B-54 to B-57protection, 106vs. segmented, B-43
Index ■ I-55
Page faultscache optimization, A-46exceptions, C-43 to C-44hardware-based speculation, 188and memory hierarchy, B-3MIPS exceptions, C-48Multimedia SIMD Extensions, 284stopping/restarting execution, C-46virtual memory definition, B-42virtual memory miss, B-45
Page offsetcache optimization, B-38main memory, B-44TLB, B-46virtual memory, B-43, B-49, B-55
to B-56Pages
definition, B-3vs. segments, B-43size selection, B-46 to B-47virtual memory definition, B-42 to
B-43virtual memory fast address
translation, B-46Page size
cache optimization, B-38definition, B-56memory hierarchy example, B-39,
B-48and OS, B-58OS determination, B-58paged virtual memory, B-55selection, B-46 to B-47virtual memory, B-44
Page Table Entry (PTE)AMD64 paged virtual memory,
B-56IA-32 equivalent, B-52Intel Core i7, 120main memory block, B-44 to B-45paged virtual memory, B-56TLB, B-47
Page tablesaddress translation, B-46 to B-47AMD64 paged virtual memory,
B-55 to B-56descriptor tables as, B-52IA-32 segment descriptors, B-53main memory block, B-44 to B-45multiprocessor software
development, 407–409multithreading, 224
protection process, B-50segmented virtual memory, B-51virtual memory block
identification, B-44virtual-to-physical address
mapping, B-45Paired single operations, DSP media
extensions, E-11Palt, definition, B-3Papadopolous, Greg, L-74Parallelism
cache optimization, 79challenges, 349–351classes, 9–10computer design principles, 44–45dependence analysis, H-8DLP, see Data-level parallelism
(DLP)Ethernet, F-78exploitation statically, H-2exposing with hardware support,
H-23 to H-27global code scheduling, H-15 to
H-23, H-16IA-64 instruction format, H-34 to
H-35ILP, see Instruction-level
parallelism (ILP)loop-level, 149–150, 215,
217–218, 315–322MIPS scoreboarding, C-77 to C-78multiprocessors, 345natural, 223, 344request-level, 4–5, 9, 345, 434RISC development, 2software pipelining, H-12 to H-15for speedup, 263superblock scheduling, H-21 to
H-23, H-22task-level, 9TLP, see Thread-level parallelism
(TLP)trace scheduling, H-19 to H-21,
H-20vs. window size, 217WSCs vs. servers, 433–434
Parallel processorsareas of debate, L-56 to L-58bus-based coherent multiprocessor
history, L-59 to L-60cluster history, L-62 to L-64early computers, L-56
large-scale multiprocessor history, L-60 to L-61
recent advances and developments, L-58 to L-60
scientific applications, I-33 to I-34SIMD computer history, L-55 to
L-56synchronization and consistency
models, L-64virtual memory history, L-64
Parallel programmingcomputation communication, I-10
to I-12with large-scale multiprocessors, I-2
Parallel Thread Execution (PTX)basic GPU thread instructions, 299GPU conditional branching,
300–303GPUs vs. vector architectures, 308NVIDIA GPU ISA, 298–300NVIDIA GPU Memory structures,
305Parallel Thread Execution (PTX)
InstructionCUDA Thread, 300definition, 292, 309, 313GPU conditional branching, 302–303GPU terms, 308NVIDIA GPU ISA, 298, 300
Paravirtualizationsystem call performance, 141Xen VM, 111
Paritydirty bits, D-61 to D-64fault detection, 58memory dependability, 104–105WSC memory, 473–474
PARSEC benchmarksIntel Core i7, 401–405SMT on superscalar processors,
230–232, 231speedup without SMT, 403–404
Partial disk failure, dirty bits, D-61 to D-64
Partial store order, relaxed consistency models, 395
Partitioned add operation, DSP media extensions, E-10
PartitioningMultimedia SIMD Extensions, 282virtual memory protection, B-50WSC memory hierarchy, 445
I-56 ■ Index
Pascal programscompiler types and classes, A-28integer division/remainder, J-12
Pattern, disk array deconstruction, D-51Payload
messages, F-6packet format, F-7
p bits, J-21 to J-23, J-25, J-36 to J-37PC, see Program counter (PC)PCI bus, historical background, L-81PCIe, see PCI-Express (PCIe)PCI-Express (PCIe), F-29, F-63
storage area network history, F-102 to F-103
PCI-X, F-29storage area network history,
F-102PCI-X 2.0, F-63PCMCIA slot, Sony PlayStation 2
Emotion Engine case study, E-15
PC-relative addressing mode, VAX, K-67
PDP-11, L-10, L-17 to L-19, L-56PDU, see Power distribution unit
(PDU)Peak performance
Cray X1E, G-24DAXPY on VMIPS, G-21DLP, 322fallacies, 57–58multiple lanes, 273multiprocessor scaled programs,
58Roofline model, 287vector architectures, 331VMIPS on DAXPY, G-17WSC operational costs, 434
Peer-to-peerinternetworking, F-81 to F-82wireless networks, E-22
Pegasus, L-16PennySort competition, D-66Perfect Club benchmarks
vector architecture programming, 281, 281–282
vector processor history, G-28Perfect processor, ILP hardware
model, 214–215, 215Perfect-shuffle exchange,
interconnection network topology, F-30 to F-31
Performability, RAID reconstruction, D-55 to D-57
Performance, see also Peak performance
advanced directory protocol case study, 420–426
ARM Cortex-A8, 233–236, 234ARM Cortex-A8 memory,
115–117bandwidth vs. latency, 18–19benchmarks, 37–41branch penalty reduction, C-22branch schemes, C-25 to C-26cache basics, B-3 to B-6cache performance
average memory access time, B-16 to B-20
basic considerations, B-3 to B-6, B-16
basic equations, B-22basic optimizations, B-40example calculation, B-16 to
B-17out-of-order processors, B-20
to B-22compiler optimization impact,
A-27cost-performance
extensive pipelining, C-80 to C-81
WSC Flash memory, 474–475WSC goals/requirements, 433WSC hardware inactivity, 474WSC processors, 472–473
CUDA, 290–291desktop benchmarks, 38–40directory-based coherence case
study, 418–420dirty bits, D-61 to D-64disk array deconstruction, D-51 to
D-55disk deconstruction, D-48 to D-51DRAM, 100–102embedded computers, 9, E-13 to
E-14Google server benchmarks,
439–441hardward fallacies, 56high-performance computing, 432,
435–436, B-10historical milestones, 20ILP exploitation, 201
ILP for realizable processors, 216–218
Intel Core i7, 239–241, 240, 401–405
Intel Core i7 memory, 122–124interconnection networks
bandwidth considerations, F-89multi-device networks, F-25 to
F-29routing/arbitration/switching
impact, F-52 to F-55two-device networks, F-12 to
F-20Internet Archive Cluster, D-38 to
D-40interprocessor communication, I-3
to I-6I/O devices, D-15 to D-16I/O subsystem design, D-59 to
D-61I/O system design/evaluation,
D-36ISA, 241–243Itanium 2, H-43large-scale multiprocessors
scientific applicationsdistributed-memory
multiprocessors, I-26 to I-32, I-28 to I-30, I-32
parallel processors, I-33 to I-34
symmetric shared-memory multiprocessor, I-21 to I-26, I-23 to I-25
synchronization, I-12 to I-16MapReduce, 438measurement, reporting,
summarization, 36–37memory consistency models, 393memory hierarchy design, 73memory hierarchy and OS, B-58memory threads, GPUs, 332MIPS FP pipeline, C-60 to C-61MIPS M2000 vs. VAX 8700, K-82MIPS R4000 pipeline, C-67 to
C-70, C-68multicore processors, 400–401,
401multiprocessing/multithreading,
398–400multiprocessors, measurement
issues, 405–406
Index ■ I-57
multiprocessor software development, 408–409
network topologies, F-40, F-40 to F-44
observed, 57peak
DLP, 322fallacies, 57–58multiple lanes, 273Roofline model, 287vector architectures, 331WSC operational costs, 434
pipelines with stalls, C-12 to C-13pipelining basics, C-10 to C-11processors, historical growth, 2–3,
3quantitative measures, L-6 to L-7real-time, PMDs, 6real-world server considerations,
52–55results reporting, 41results summarization, 41–43, 43RISC classic pipeline, C-7server benchmarks, 40–41as server characteristic, 7single-chip multicore processor
case study, 412–418single-thread, 399
processor benchmarks, 243software development, 4software overhead issues, F-91 to
F-92sorting case study, D-64 to D-67speculation cost, 211Sun T1 multithreading unicore,
227–229superlinear, 406switch microarchitecture
pipelining, F-60 to F-61symmetric shared-memory
multiprocessors, 366–378
scientific workloads, I-21 to I-26, I-23
system call virtualization/paravirtualization, 141
transistors, scaling, 19–21vector, and memory bandwidth,
332vector add instruction, 272vector kernel implementation,
334–336
vector processor, G-2 to G-7DAXPY on VMIPS, G-19 to
G-21sparse matrices, G-12 to G-14start-up and multiple lanes, G-7
to G-9vector processors
chaining, G-11 to G-12chaining/unchaining, G-12
vector vs. scalar, 331–332VMIPS on Linpack, G-17 to G-19wormhole switching, F-92 to F-93
Permanent failure, commercial interconnection networks, F-66
Permanent faults, storage systems, D-11
Personal computersLANs, F-4networks, F-2PCIe, F-29
Personal mobile device (PMD)characteristics, 6as computer class, 5embedded computers, 8–9Flash memory, 18integrated circuit cost trends, 28ISA performance and efficiency
prediction, 241–243memory hierarchy basics, 78memory hierarchy design, 72power and energy, 25processor comparison, 242
PetaBox GB2000, Internet Archive Cluster, D-37
Phase-ordering problem, compiler structure, A-26
Phits, see Physical transfer units (phits)
Physical addressesaddress translation, B-46AMD Opteron data cache, B-12 to
B-13ARM Cortex-A8, 115directory-based cache coherence
protocol basics, 382main memory block, B-44memory hierarchy, B-48 to B-49memory hierarchy basics, 77–78memory mapping, B-52paged virtual memory, B-55 to
B-56
page table-based mapping, B-45safe calls, B-54segmented virtual memory, B-51sharing/protection, B-52translation, B-36 to B-39virtual memory definition, B-42
Physical cache, definition, B-36 to B-37Physical channels, F-47Physical layer, definition, F-82Physical memory
centralized shared-memory multiprocessors, 347
directory-based cache coherence, 354
future GPU features, 332GPU conditional branching, 303main memory block, B-44memory hierarchy basics, B-41 to
B-42multiprocessors, 345paged virtual memory, B-56processor comparison, 323segmented virtual memory, B-51unified, 333Virtual Machines, 110
Physical transfer units (phits), F-60Physical volumes, D-34PID, see Process-identifier (PID) tagsPin-out bandwidth, topology, F-39Pipeline bubble, stall as, C-13Pipeline cycles per instruction
basic equation, 148ILP, 149processor performance
calculations, 218–219R4000 performance, C-68 to C-69
Pipelined circuit switching, F-50Pipelined CPUs, early versions, L-26
to L-27Pipeline delays
ARM Cortex-A8, 235definition, 228fine-grained multithreading, 227instruction set complications, C-50multiple branch speculation, 211Sun T1 multithreading unicore
performance, 227–228Pipeline interlock
data dependences, 151data hazards requiring stalls, C-20MIPS R4000, C-65MIPS vs. VMIPS, 268
I-58 ■ Index
Pipeline latchesALU, C-40definition, C-35R4000, C-60stopping/restarting execution, C-47
Pipeline organizationdependences, 152MIPS, C-37
Pipeline registersbranch hazard stall, C-42data hazards, C-57data hazard stalls, C-17 to C-20definition, C-35example, C-9MIPS, C-36 to C-39MIPS extension, C-53PC as, C-35pipelining performance issues,
C-10RISC processor, C-8, C-10
Pipeline schedulingbasic considerations, 161–162vs. dynamic scheduling, 168–169ILP exploitation, 197ILP exposure, 157–161microarchitectural techniques case
study, 247–254MIPS R4000, C-64
Pipeline stall cyclesbranch scheme performance, C-25pipeline performance, C-12 to C-13
Pipeliningbranch cost reduction, C-26branch hazards, C-21 to C-26branch issues, C-39 to C-42branch penalty reduction, C-22 to
C-25branch-prediction buffers, C-27 to
C-30, C-29branch scheme performance, C-25
to C-26cache access, 82case studies, C-82 to C-88classic stages for RISC, C-6 to
C-10compiler scheduling, L-31concept, C-2 to C-3cost-performance, C-80 to C-81data hazards, C-16 to C-21definition, C-2dynamically scheduled pipelines,
C-70 to C-80
example, C-8exception stopping/restarting, C-46
to C-47exception types and requirements,
C-43 to C-46execution sequences, C-80floating-point addition speedup,
J-25graphics pipeline history, L-51hazard classes, C-11hazard detection, C-38implementation difficulties, C-43
to C-49independent FP operations, C-54instruction set complications, C-49
to C-51interconnection networks, F-12latencies, C-87MIPS, C-34 to C-36MIPS control, C-36 to C-39MIPS exceptions, C-48, C-48 to
C-49MIPS FP performance, C-60 to
C-61MIPS multicycle operations
basic considerations, C-51 to C-54
hazards and forwarding, C-54 to C-58
precise exceptions, C-58 to C-60MIPS R4000
FP pipeline, C-65 to C-67, C-67
overview, C-61 to C-65pipeline performance, C-67 to
C-70pipeline structure, C-62 to C-63
multiple outstanding FP operations, C-54
performance issues, C-10 to C-11performance with stalls, C-12 to
C-13predicted-not-taken scheme, C-22RISC instruction set, C-4 to C-5,
C-70simple implementation, C-30 to
C-43, C-34simple RISC, C-5 to C-6, C-7static branch prediction, C-26 to
C-27structural hazards, C-13 to C-16,
C-15
switch microarchitecture, F-60 to F-61
unoptimized code, C-81Pipe segment, definition, C-3Pipe stage
branch prediction, C-28data hazards, C-16definition, C-3dynamic scheduling, C-71FP pipeline, C-66integrated instruction fetch units,
207MIPS, C-34 to C-35, C-37, C-49MIPS extension, C-53MIPS R4000, C-62out-of-order execution, 170pipeline stalls, C-13pipeling performance issues, C-10register additions, C-35RISC processor, C-7stopping/restarting execution, C-46WAW, 153
pjbb2005 benchmarkIntel Core i7, 402SMT on superscalar processors,
230–232, 231PLA, early computer arithmetic, J-65PMD, see Personal mobile device
(PMD)Points-to analysis, basic approach, H-9Point-to-point links
bus replacement, D-34Ethernet, F-79storage systems, D-34switched-media networks, F-24
Point-to-point multiprocessor, example, 413
Point-to-point networksdirectory-based coherence, 418directory protocol, 421–422SMP limitations, 363–364
Poison bits, compiler-based speculation, H-28, H-30
Poisson, Siméon, D-28Poisson distribution
basic equation, D-28random variables, D-26 to D-34
Polycyclic scheduling, L-30Portable computers
interconnection networks, F-85processor comparison, 242
Port number, network interfaces, F-7
Index ■ I-59
Position independence, control flow instruction addressing modes, A-17
Powerdistribution for servers, 490distribution overview, 447and DLP, 322first-level caches, 79–80Google server benchmarks,
439–441Google WSC, 465–468PMDs, 6real-world server considerations,
52–55WSC infrastructure, 447WSC power modes, 472WSC resource allocation case
study, 478–479WSC TCO case study, 476–478
Power consumption, see also Energy efficiency
cache optimization, 96cache size and associativity, 81case study, 63–64computer components, 63DDR3 SDRAM, 103disks, D-5embedded benchmarks, E-13GPUs vs. vector architectures, 311interconnection networks, F-85ISA performance and efficiency
prediction, 242–243microprocessor, 23–26SDRAMs, 102SMT on superscalar processors,
230–231speculation, 210–211system trends, 21–23TI TMS320C55 DSP, E-8WSCs, 450
Power distribution unit (PDU), WSC infrastructure, 447
Power failureexceptions, C-43 to C-44, C-46utilities, 435WSC storage, 442
Power gating, transistors, 26Power modes, WSCs, 472PowerPC
addressing modes, K-5AltiVec multimedia instruction
compiler support, A-31
ALU, K-5arithmetic/logical instructions,
K-11branches, K-21cluster history, L-63conditional branches, K-17conditional instructions, H-27condition codes, K-10 to K-11consistency model, 395constant extension, K-9conventions, K-13data transfer instructions, K-10features, K-44FP instructions, K-23IBM Blue Gene/L, I-41 to I-42multimedia compiler support,
A-31, K-17precise exceptions, C-59RISC architecture, A-2RISC code size, A-23as RISC systems, K-4unique instructions, K-32 to K-33
PowerPC ActiveCcharacteristics, K-18multimedia support, K-19
PowerPC AltiVec, multimedia support, E-11
Power-performancelow-power servers, 477servers, 54
Power Supply Units (PSUs), efficiency ratings, 462
Power utilization effectiveness (PUE)datacenter comparison, 451Google WSC, 468Google WSC containers, 464–465WSC, 450–452WSCs vs. datacenters, 456WSC server energy efficiency, 462
Precise exceptionsdefinition, C-47dynamic scheduling, 170hardware-based speculation,
187–188, 221instruction set complications, C-49maintaining, C-58 to C-60MIPS exceptions, C-48
Precisions, floating-point arithmetic, J-33 to J-34
Predicated instructionsexposing parallelism, H-23 to H-27IA-64, H-38 to H-40
Predicate Registersdefinition, 309GPU conditional branching, 300–301IA-64, H-34NVIDIA GPU ISA, 298vectors vs. GPUs, 311
Predication, TI TMS320C6x DSP, E-10Predicted-not-taken scheme
branch penalty reduction, C-22, C-22 to C-23
MIPS R4000 pipeline, C-64Predictions, see also Mispredictions
address aliasing, 213–214, 216branch
correlation, 162–164cost reduction, 162–167, C-26dynamic, C-27 to C-30ideal processor, 214ILP exploitation, 201instruction fetch bandwidth, 205integrated instruction fetch
units, 207Intel Core i7, 166–167, 239–241static, C-26 to C-27
branch-prediction buffers, C-27 to C-30, C-29
jump prediction, 214PMDs, 6return address buffer, 2072-bit scheme, C-28value prediction, 202, 212–213
Prefetchingintegrated instruction fetch units,
208Intel Core i7, 122, 123–124Itanium 2, H-42MIPS core extensions, K-20NVIDIA GPU Memory structures,
305parallel processing challenges, 351
Prefix, Intel 80x86 integer operations, K-51
Presentation layer, definition, F-82Present bit, IA-32 descriptor table,
B-52Price vs. cost, 32–33Price-performance ratio
cost trends, 28Dell PowerEdge servers, 53desktop comptuers, 6processor comparisons, 55WSCs, 8, 441
I-60 ■ Index
Primitivesarchitect-compiler writer
relationship, A-30basic hardware types, 387–389compiler writing-architecture
relationship, A-30CUDA Thread, 289dependent computation
elimination, 321GPU vs. MIMD, 329locks via coherence, 391operand types and sizes, A-14 to
A-15PA-RISC instructions, K-34 to
K-35synchronization, 394, L-64
Principle of localitybidirectional MINs, F-33 to F-34cache optimization, B-26cache performance, B-3 to B-4coining of term, L-11commercial workload, 373computer design principles, 45definition, 45, B-2lock accesses, 390LRU, B-9memory accesses, 332, B-46memory hierarchy design, 72multilevel application, B-2multiprogramming workload, 375scientific workloads on symmetric
shared-memory multiprocessors, I-25
stride, 278WSC bottleneck, 461WSC efficiency, 450
Private datacache protocols, 359centralized shared-memory
multiprocessors, 351–352
Private Memorydefinition, 292, 314NVIDIA GPU Memory structures,
304Private variables, NVIDIA GPU
Memory, 304Procedure calls
compiler structure, A-25 to A-26control flow instructions, A-17,
A-19 to A-21dependence analysis, 321
high-level instruction set, A-42 to A-43
IA-64 register model, H-33invocation options, A-19ISAs, 14MIPS control flow instructions, A-38return address predictors, 206VAX, B-73 to B-74, K-71 to K-72VAX vs. MIPS, K-75VAX swap, B-74 to B-75
Process conceptdefinition, 106, B-49protection schemes, B-50
Process-identifier (PID) tags, cache addressing, B-37 to B-38
Process IDs, Virtual Machines, 110Processor consistency
latency hiding with speculation, 396–397
relaxed consistency models, 395Processor cycles
cache performance, B-4definition, C-3memory banks, 277multithreading, 224
Processor-dependent optimizationscompilers, A-26performance impact, A-27types, A-28
Processor-intensive benchmarks, desktop performance, 38
Processor performanceand average memory access time,
B-17 to B-20vs. cache performance, B-16clock rate trends, 24desktop benchmarks, 38, 40historical trends, 3, 3–4multiprocessors, 347uniprocessors, 344
Processor performance equation, computer design principles, 48–52
Processor speedand clock rate, 244and CPI, 244snooping cache coherence, 364
Process switchdefinition, 106, B-49miss rate vs. virtual addressing,
B-37
multithreading, 224PID, B-37virtual memory-based protection,
B-49 to B-50Producer-server model, response time
and throughput, D-16Productivity
CUDA, 290–291NVIDIA programmers, 289software development, 4virtual memory and programming,
B-41WSC, 450
Profile-based predictor, misprediction rate, C-27
Program counter (PC)addressing modes, A-10ARM Cortex-A8, 234branch hazards, C-21branch-target buffers, 203,
203–204, 206control flow instruction addressing
modes, A-17dynamic branch prediction, C-27
to C-28exception stopping/restarting, C-46
to C-47GPU conditional branching, 303Intel Core i7, 120M32R instructions, K-39MIPS control flow instructions,
A-38multithreading, 223–224pipeline branch issues, C-39 to
C-41pipe stages, C-35precise exceptions, C-59 to C-60RISC classic pipeline, C-8RISC instruction set, C-5simple MIPS implementation,
C-31 to C-33TLP, 344virtual memory protection, 106
Program counter-relative addressingcontrol flow instructions, A-17 to
A-18, A-21definition, A-10MIPS instruction format, A-35
Programming modelsCUDA, 300, 310, 315GPUs, 288–291latency in consistency models, 397
Index ■ I-61
memory consistency, 393Multimedia SIMD architectures,
285vector architectures, 280–282WSCs, 436–441
Programming primitive, CUDA Thread, 289
Program ordercache coherence, 353control dependences, 154–155data hazards, 153dynamic scheduling, 168–169, 174hardware-based speculation, 192ILP exploitation, 200name dependences, 152–153Tomasulo’s approach, 182
Protection schemescontrol dependence, 155development, L-9 to L-12and ISA, 112network interfaces, F-7network user access, F-86 to F-87Pentium vs. Opteron, B-57processes, B-50safe calls, B-54segmented virtual memory
example, B-51 to B-54Virtual Machines, 107–108virtual memory, 105–107, B-41
Protocol deadlock, routing, F-44Protocol stack
example, F-83internetworking, F-83
Pseudo-least recently used (LRU)block replacement, B-9 to B-10Intel Core i7, 118
PSUs, see Power Supply Units (PSUs)PTE, see Page Table Entry (PTE)PTX, see Parallel Thread Execution
(PTX)PUE, see Power utilization
effectiveness (PUE)Python language, hardware impact on
software development, 4
QQCDOD, L-64QoS, see Quality of service (QoS)QsNetII, F-63, F-76Quadrics SAN, F-67, F-100 to F-101Quality of service (QoS)
dependability benchmarks, D-21
WAN history, F-98Quantitative performance measures,
development, L-6 to L-7Queue
definition, D-24waiting time calculations, D-28 to
D-29Queue discipline, definition, D-26Queuing locks, large-scale
multiprocessor synchronization, I-18 to I-21
Queuing theorybasic assumptions, D-30Little’s law, D-24 to D-25M/M/1 model, D-31 to D-33, D-32overview, D-23 to D-26RAID performance prediction,
D-57 to D-59single-server model, D-25
Quickpath (Intel Xeon), cache coherence, 361
RRace-to-halt, definition, 26Rack units (U), WSC architecture, 441Radio frequency amplifier, radio
receiver, E-23Radio receiver, components, E-23Radio waves, wireless networks, E-21Radix-2 multiplication/division, J-4 to
J-7, J-6, J-55Radix-4 multiplication/division, J-48
to J-49, J-49, J-56 to J-57, J-60 to J-61
Radix-8 multiplication, J-49RAID (Redundant array of
inexpensive disks)data replication, 439dependability benchmarks, D-21,
D-22disk array deconstruction case
study, D-51, D-55disk deconstruction case study,
D-48hardware dependability, D-15historical background, L-79 to
L-80I/O subsystem design, D-59 to
D-61logical units, D-35memory dependability, 104
NetApp FAS6000 filer, D-41 to D-42
overview, D-6 to D-8, D-7performance prediction, D-57 to
D-59reconstruction case study, D-55 to
D-57row-diagonal parity, D-9WSC storage, 442
RAID 0, definition, D-6RAID 1
definition, D-6historical background, L-79
RAID 2definition, D-6historical background, L-79
RAID 3definition, D-7historical background, L-79 to
L-80RAID 4
definition, D-7historical background, L-79 to
L-80RAID 5
definition, D-8historical background, L-79 to
L-80RAID 6
characteristics, D-8 to D-9hardware dependability, D-15
RAID 10, D-8RAM (random access memory), switch
microarchitecture, F-57RAMAC-350 (Random Access
Method of Accounting Control), L-77 to L-78, L-80 to L-81
Random Access Method of Accounting Control, L-77 to L-78
Random replacementcache misses, B-10definition, B-9
Random variables, distribution, D-26 to D-34
RAR, see Read after read (RAR)RAS, see Row access strobe (RAS)RAW, see Read after write (RAW)Ray casting (RC)
GPU comparisons, 329throughput computing kernel, 327
I-62 ■ Index
RDMA, see Remote direct memory access (RDMA)
Read after read (RAR), absence of data hazard, 154
Read after write (RAW)data hazards, 153dynamic scheduling with
Tomasulo’s algorithm, 170–171
first vector computers, L-45hazards, stalls, C-55hazards and forwarding, C-55 to
C-57instruction set complications, C-50microarchitectural techniques case
study, 253MIPS FP pipeline performance,
C-60 to C-61MIPS pipeline control, C-37 to C-38MIPS pipeline FP operations, C-53MIPS scoreboarding, C-74ROB, 192TI TMS320C55 DSP, E-8Tomasulo’s algorithm, 182unoptimized code, C-81
Read missAMD Opteron data cache, B-14cache coherence, 357, 358,
359–361coherence extensions, 362directory-based cache coherence
protocol example, 380, 382–386
memory hierarchy basics, 76–77memory stall clock cycles, B-4miss penalty reduction, B-35 to
B-36Opteron data cache, B-14vs. write-through, B-11
Read operands stageID pipe stage, 170MIPS scoreboarding, C-74 to C-75out-of-order execution, C-71
Realizable processors, ILP limitations, 216–220
Real memory, Virtual Machines, 110Real-time constraints, definition, E-2Real-time performance, PMDs, 6Real-time performance requirement,
definition, E-3Real-time processing, embedded
systems, E-3 to E-5
Rearrangeably nonblocking, centralized switched networks, F-32 to F-33
Receiving overheadcommunication latency, I-3 to I-4interconnection networks, F-88OCNs vs. SANs, F-27time of flight, F-14
RECN, see Regional explicit congestion notification (RECN)
Reconfiguration deadlock, routing, F-44
Reconstruction, RAID, D-55 to D-57Recovery time, vector processor, G-8Recurrences
basic approach, H-11loop-carried dependences, H-5
Red-black Gauss-Seidel, Ocean application, I-9 to I-10
Reduced Instruction Set Computer, see RISC (Reduced Instruction Set Computer)
Reductionscommercial workloads, 371cost trends, 28loop-level parallelism
dependences, 321multiprogramming workloads, 377T1 multithreading unicore
performance, 227WSCs, 438
RedundancyAmdahl’s law, 48chip fabrication cost case study,
61–62computer system power
consumption case study, 63–64
index checks, B-8integrated circuit cost, 32integrated circuit failure, 35simple MIPS implementation,
C-33WSC, 433, 435, 439WSC bottleneck, 461WSC storage, 442
Redundant array of inexpensive disks, see RAID (Redundant array of inexpensive disks)
Redundant multiplication, integers, J-48
Redundant power supplies, example calculations, 35
Reference bitmemory hierarchy, B-52virtual memory block replacement,
B-45Regional explicit congestion
notification (RECN), congestion management, F-66
Register addressing modeMIPS, 12VAX, K-67
Register allocationcompilers, 396, A-26 to A-29VAX sort, K-76VAX swap, K-72
Register deferred addressing, VAX, K-67
Register definition, 314Register fetch (RF)
MIPS data path, C-34MIPS R4000, C-63pipeline branches, C-41simple MIPS implementation,
C-31simple RISC implementation, C-5
to C-6Register file
data hazards, C-16, C-18, C-20dynamic scheduling, 172, 173,
175, 177–178Fermi GPU, 306field, 176hardware-based speculation, 184longer latency pipelines, C-55 to
C-57MIPS exceptions, C-49MIPS implementation, C-31, C-33MIPS R4000, C-64MIPS scoreboarding, C-75Multimedia SIMD Extensions,
282, 285multiple lanes, 272, 273multithreading, 224OCNs, F-3precise exceptions, C-59RISC classic pipeline, C-7 to C-8RISC instruction set, C-5 to C-6scoreboarding, C-73, C-75
Index ■ I-63
speculation support, 208structural hazards, C-13Tomasulo’s algorithm, 180, 182vector architecture, 264VMIPS, 265, 308
Register indirect addressing mode, Intel 80x86, K-47
Register management, software-pipelined loops, H-14
Register-memory instruction set architecture
architect-compiler writer relationship, A-30
dynamic scheduling, 171Intel 80x86, K-52ISA classification, 11, A-3 to A-6
Register prefetch, cache optimization, 92
Register renamingdynamic scheduling, 169–172hardware vs. software speculation,
222ideal processor, 214ILP hardware model, 214ILP limitations, 213, 216–217ILP for realizable processors, 216instruction delivery and
speculation, 202microarchitectural techniques case
study, 247–254name dependences, 153vs. ROB, 208–210ROB instruction, 186sample code, 250SMT, 225speculation, 208–210superscalar code, 251Tomasulo’s algorithm, 183WAW/WAR hazards, 220
Register result status, MIPS scoreboard, C-76
RegistersDSP examples, E-6IA-64, H-33 to H-34instructions and hazards, C-17Intel 80x86, K-47 to K-49, K-48network interface functions, F-7pipe stages, C-35PowerPC, K-10 to K-11VAX swap, B-74 to B-75
Register stack engine, IA-64, H-34
Register tag example, 177Register windows, SPARC
instructions, K-29 to K-30
Regularitybidirectional MINs, F-33 to F-34compiler writing-architecture
relationship, A-30Relative speedup, multiprocessor
performance, 406Relaxed consistency models
basic considerations, 394–395compiler optimization, 396WSC storage software, 439
Release consistency, relaxed consistency models, 395
ReliabilityAmdahl’s law calculations, 56commercial interconnection
networks, F-66example calculations, 48I/O subsystem design, D-59 to
D-61modules, SLAs, 34MTTF, 57redundant power supplies, 34–35storage systems, D-44transistor scaling, 21
Relocation, virtual memory, B-42Remainder, floating point, J-31 to
J-32Remington-Rand, L-5Remote direct memory access
(RDMA), InfiniBand, F-76
Remote node, directory-based cache coherence protocol basics, 381–382
Reorder buffer (ROB)compiler-based speculation, H-31dependent instructions, 199dynamic scheduling, 175FP unit with Tomasulo’s
algorithm, 185hardware-based speculation,
184–192ILP exploitation, 199–200ILP limitations, 216Intel Core i7, 238vs. register renaming, 208–210
Repeat interval, MIPS pipeline FP operations, C-52 to C-53
Replicationcache coherent multiprocessors, 354centralized shared-memory
architectures, 351–352coherence enforcement, 354R4000 performance, C-70RAID storage servers, 439TLP, 344virtual memory, B-48 to B-49WSCs, 438
Reply, messages, F-6Reproducibility, performance results
reporting, 41Request
messages, F-6switch microarchitecture, F-58
Requested protection level, segmented virtual memory, B-54
Request-level parallelism (RLP)basic characteristics, 345definition, 9from ILP, 4–5MIMD, 10multicore processors, 400multiprocessors, 345parallelism advantages, 44server benchmarks, 40WSCs, 434, 436
Request phase, arbitration, F-49Request-reply deadlock, routing, F-44Reservation stations
dependent instructions, 199–200dynamic scheduling, 178example, 177fields, 176hardware-based speculation, 184,
186, 189–191ILP exploitation, 197, 199–200Intel Core i7, 238–240loop iteration example, 181microarchitectural techniques case
study, 253–254speculation, 208–209Tomasulo’s algorithm, 172, 173,
174–176, 179, 180, 180–182
Resource allocationcomputer design principles, 45WSC case study, 478–479
Resource sparing, commercial interconnection networks, F-66
I-64 ■ Index
Response time, see also LatencyI/O benchmarks, D-18performance considerations, 36performance trends, 18–19producer-server model, D-16server benchmarks, 40–41storage systems, D-16 to D-18vs. throughput, D-17user experience, 4WSCs, 450
ResponsivenessPMDs, 6as server characteristic, 7
Restartable pipelinedefinition, C-45exceptions, C-46 to C-47
Restorations, SLA states, 34Restoring division, J-5, J-6Resume events
control dependences, 156exceptions, C-45 to C-46hardware-based speculation, 188
Return address predictorsinstruction fetch bandwidth,
206–207prediction accuracy, 207
ReturnsAmdahl’s law, 47cache coherence, 352–353compiler technology and
architectural decisions, A-28
control flow instructions, 14, A-17, A-21
hardware primitives, 388Intel 80x86 integer operations,
K-51invocation options, A-19procedure invocation options,
A-19return address predictors, 206
Reverse path, cell phones, E-24RF, see Register fetch (RF)Rings
characteristics, F-73NEWS communication, F-42OCN history, F-104process protection, B-50topology, F-35 to F-36, F-36
Ripple-carry adder, J-3, J-3, J-42chip comparison, J-60
Ripple-carry addition, J-2 to J-3
RISC (Reduced Instruction Set Computer)
addressing modes, K-5 to K-6Alpha-unique instructions, K-27 to
K-29architecture flaws vs. success,
A-45ARM-unique instructions, K-36 to
K-37basic concept, C-4 to C-5basic systems, K-3 to K-5cache performance, B-6classic pipeline stages, C-6 to C-10code size, A-23 to A-24compiler history, L-31desktop/server systems, K-4
instruction formats, K-7multimedia extensions, K-16 to
K-19desktop systems
addressing modes, K-5arithmetic/logical instructions,
K-11, K-22conditional branches, K-17constant extension, K-9control instructions, K-12conventions, K-13data transfer instructions, K-10,
K-21features, K-44FP instructions, K-13, K-23multimedia extensions, K-18
development, 2early pipelined CPUs, L-26embedded systems, K-4
addressing modes, K-6arithmetic/logical instructions,
K-15, K-24conditional branches, K-17constant extension, K-9control instructions, K-16conventions, K-16data transfers, K-14, K-23DSP extensions, K-19instruction formats, K-8multiply-accumulate, K-20
historical background, L-19 to L-21
instruction formats, K-5 to K-6instruction set lineage, K-43ISA performance and efficiency
prediction, 241
M32R-unique instructions, K-39 to K-40
MIPS16-unique instructions, K-40 to K-42
MIPS64-unique instructions, K-24 to K-27
MIPS core common extensions, K-19 to K-24
MIPS M2000 vs. VAX 8700, L-21Multimedia SIMD Extensions
history, L-49 to L-50operations, 12PA-RISC-unique, K-33 to K-35pipelining efficiency, C-70PowerPC-unique instructions,
K-32 to K-33Sanyo VPC-SX500 digital camera,
E-19simple implementation, C-5 to C-6simple pipeline, C-7SPARC-unique instructions, K-29
to K-32Sun T1 multithreading, 226–227SuperH-unique instructions, K-38
to K-39Thumb-unique instructions, K-37
to K-38vector processor history, G-26Virtual Machines ISA support, 109Virtual Machines and virtual
memory and I/O, 110RISC-I, L-19 to L-20RISC-II, L-19 to L-20RLP, see Request-level parallelism
(RLP)ROB, see Reorder buffer (ROB)Roofline model
GPU performance, 326memory bandwidth, 332Multimedia SIMD Extensions,
285–288, 287Round digit, J-18Rounding modes, J-14, J-17 to J-19,
J-18, J-20FP precisions, J-34fused multiply-add, J-33
Round-robin (RR)arbitration, F-49IBM 360, K-85 to K-86InfiniBand, F-74
RoutersBARRNet, F-80
Index ■ I-65
Ethernet, F-79Routing algorithm
commercial interconnection networks, F-56
fault tolerance, F-67implementation, F-57Intel SCCC, F-70interconnection networks, F-21 to
F-22, F-27, F-44 to F-48mesh network, F-46network impact, F-52 to F-55OCN history, F-104and overhead, F-93 to F-94SAN characteristics, F-76switched-media networks, F-24switch microarchitecture
pipelining, F-61system area network history, F-100
Row access strobe (RAS), DRAM, 98Row-diagonal parity
example, D-9RAID, D-9
Row major order, blocking, 89RR, see Round-robin (RR)RS format instructions, IBM 360,
K-87Ruby on Rails, hardware impact on
software development, 4RX format instructions, IBM 360,
K-86 to K-87
SS3, see Amazon Simple Storage
Service (S3)SaaS, see Software as a Service (SaaS)Sandy Bridge dies, wafter example, 31SANs, see System/storage area
networks (SANs)Sanyo digital cameras, SOC, E-20Sanyo VPC-SX500 digital camera,
embedded system case study, E-19
SAS, see Serial Attach SCSI (SAS) drive
SASI, L-81SATA (Serial Advanced Technology
Attachment) disksGoogle WSC servers, 469NetApp FAS6000 filer, D-42power consumption, D-5RAID 6, D-8vs. SAS drives, D-5
storage area network history, F-103Saturating arithmetic, DSP media
extensions, E-11Saturating operations, definition, K-18
to K-19SAXPY, GPU raw/relative
performance, 328Scalability
cloud computing, 460coherence issues, 378–379Fermi GPU, 295Java benchmarks, 402multicore processors, 400multiprocessing, 344, 395parallelism, 44as server characteristic, 7transistor performance and wires,
19–21WSCs, 8, 438WSCs vs. servers, 434
Scalable GPUs, historical background, L-50 to L-51
Scalar expansion, loop-level parallelism dependences, 321
Scalar Processors, see also Superscalar processors
definition, 292, 309early pipelined CPUs, L-26 to L-27lane considerations, 273Multimedia SIMD/GPU
comparisons, 312NVIDIA GPU, 291prefetch units, 277vs. vector, 311, G-19vector performance, 331–332
Scalar registersCray X1, G-21 to G-22GPUs vs. vector architectures, 311loop-level parallelism
dependences, 321–322Multimedia SIMD vs. GPUs, 312sample renaming code, 251vector vs. GPU, 311vs. vector performance, 331–332VMIPS, 265–266
Scaled addressing, VAX, K-67Scaled speedup, Amdahl’s law and
parallel computers, 406–407
ScalingAmdahl’s law and parallel
computers, 406–407
cloud computing, 456computation-to-communication
ratios, I-11DVFS, 25, 52, 467dynamic voltage-frequency, 25,
52, 467Intel Core i7, 404interconnection network speed, F-88multicore vs. single-core, 402processor performance trends, 3scientific applications on parallel
processing, I-34shared- vs. switched-media
networks, F-25transistor performance and wires,
19–21VMIPS, 267
Scan Line Interleave (SLI), scalable GPUs, L-51
SCCC, see Intel Single-Chip Cloud Computing (SCCC)
Schorr, Herb, L-28Scientific applications
Barnes, I-8 to I-9basic characteristics, I-6 to I-7cluster history, L-62distributed-memory
multiprocessors, I-26 to I-32, I-28 to I-32
FFT kernel, I-7LU kernel, I-8Ocean, I-9 to I-10parallel processors, I-33 to I-34parallel program computation/
communication, I-10 to I-12, I-11
parallel programming, I-2symmetric shared-memory
multiprocessors, I-21 to I-26, I-23 to I-25
ScoreboardingARM Cortex-A8, 233, 234components, C-76definition, 170dynamic scheduling, 171, 175and dynamic scheduling, C-71 to
C-80example calculations, C-77MIPS structure, C-73NVIDIA GPU, 296results tables, C-78 to C-79SIMD thread scheduler, 296
I-66 ■ Index
Scripting languages, software development impact, 4
SCSI (Small Computer System Interface)
Berkeley’s Tertiary Disk project, D-12
dependability benchmarks, D-21disk storage, D-4historical background, L-80 to L-81I/O subsystem design, D-59RAID reconstruction, D-56storage area network history,
F-102SDRAM, see Synchronous dynamic
random-access memory (SDRAM)
SDRWAVE, J-62Second-level caches, see also L2
cachesARM Cortex-A8, 114ILP, 245Intel Core i7, 121interconnection network, F-87Itanium 2, H-41memory hierarchy, B-48 to B-49miss penalty calculations, B-33 to
B-34miss penalty reduction, B-30 to
B-35miss rate calculations, B-31 to
B-35and relative execution time, B-34speculation, 210SRAM, 99
Secure Virtual Machine (SVM), 129Seek distance
storage disks, D-46system comparison, D-47
Seek time, storage disks, D-46Segment basics
Intel 80x86, K-50vs. page, B-43virtual memory definition, B-42 to
B-43Segment descriptor, IA-32 processor,
B-52, B-53Segmented virtual memory
bounds checking, B-52Intel Pentium protection, B-51 to
B-54memory mapping, B-52vs. paged, B-43
safe calls, B-54sharing and protection, B-52 to
B-53Self-correction, Newton’s algorithm,
J-28 to J-29Self-draining pipelines, L-29Self-routing, MINs, F-48Semantic clash, high-level instruction
set, A-41Semantic gap, high-level instruction
set, A-39Semiconductors
DRAM technology, 17Flash memory, 18GPU vs. MIMD, 325manufacturing, 3–4
Sending overheadcommunication latency, I-3 to I-4OCNs vs. SANs, F-27time of flight, F-14
Sense-reversing barriercode example, I-15, I-21large-scale multiprocessor
synchronization, I-14Sequence of SIMD Lane Operations,
definition, 292, 313Sequency number, packet header, F-8Sequential consistency
latency hiding with speculation, 396–397
programmer’s viewpoint, 394relaxed consistency models,
394–395requirements and implementation,
392–393Sequential interleaving, multibanked
caches, 86, 86Sequent Symmetry, L-59Serial Advanced Technology
Attachment disks, see SATA (Serial Advanced Technology Attachment) disks
Serial Attach SCSI (SAS) drivehistorical background, L-81power consumption, D-5vs. SATA drives, D-5
Serializationbarrier synchronization, I-16coherence enforcement, 354directory-based cache coherence,
382
DSM multiprocessor cache coherence, I-37
hardware primitives, 387multiprocessor cache coherency,
353page tables, 408snooping coherence protocols, 356write invalidate protocol
implementation, 356Serpentine recording, L-77Serve-longest-queue (SLQ) scheme,
arbitration, F-49ServerNet interconnection network,
fault tolerance, F-66 to F-67
Servers, see also Warehouse-scale computers (WSCs)
as computer class, 5cost calculations, 454, 454–455definition, D-24energy savings, 25Google WSC, 440, 467, 468–469GPU features, 324memory hierarchy design, 72vs. mobile GPUs, 323–330multiprocessor importance, 344outage/anomaly statistics, 435performance benchmarks, 40–41power calculations, 463power distribution example, 490power-performance benchmarks,
54, 439–441power-performance modes, 477real-world examples, 52–55RISC systems
addressing modes and instruction formats, K-5 to K-6
examples, K-3, K-4instruction formats, K-7multimedia extensions, K-16
to K-19single-server model, D-25system characteristics, E-4workload demands, 439WSC vs. datacenters, 455–456WSC data transfer, 446WSC energy efficiency, 462–464vs. WSC facility costs, 472WSC memory hierarchy, 444WSC resource allocation case
study, 478–479
Index ■ I-67
vs. WSCs, 432–434WSC TCO case study, 476–478
Server side Java operations per second (ssj_ops)
example calculations, 439power-performance, 54real-world considerations, 52–55
Server utilizationcalculation, D-28 to D-29queuing theory, D-25
Service accomplishment, SLAs, 34Service Health Dashboard, AWS, 457Service interruption, SLAs, 34Service level agreements (SLAs)
Amazon Web Services, 457and dependability, 33WSC efficiency, 452
Service level objectives (SLOs)and dependability, 33WSC efficiency, 452
Session layer, definition, F-82Set associativity
and access time, 77address parts, B-9AMD Opteron data cache, B-12 to
B-14ARM Cortex-A8, 114block placement, B-7 to B-8cache block, B-7cache misses, 83–84, B-10cache optimization, 79–80, B-33 to
B-35, B-38 to B-40commercial workload, 371energy consumption, 81memory access times, 77memory hierarchy basics, 74, 76nonblocking cache, 84performance equations, B-22pipelined cache access, 82way prediction, 81
Set basicsblock replacement, B-9 to B-10definition, B-7
Set-on-less-than instructions (SLT)MIPS16, K-14 to K-15MIPS conditional branches, K-11
to K-12Settle time, D-46SFF, see Small form factor (SFF) diskSFS benchmark, NFS, D-20SGI, see Silicon Graphics systems
(SGI)
Shadow page table, Virtual Machines, 110
Sharding, WSC memory hierarchy, 445
Shared-media networkseffective bandwidth vs. nodes,
F-28exampl, F-22latency and effective bandwidth,
F-26 to F-28multiple device connections, F-22
to F-24vs. switched-media networks, F-24
to F-25Shared Memory
definition, 292, 314directory-based cache coherence,
418–420DSM, 347–348, 348, 354–355,
378–380invalidate protocols, 356–357SMP/DSM definition, 348terminology comparison, 315
Shared-memory communication, large-scale multiprocessors, I-5
Shared-memory multiprocessorsbasic considerations, 351–352basic structure, 346–347cache coherence, 352–353cache coherence enforcement,
354–355cache coherence example,
357–362cache coherence extensions,
362–363data caching, 351–352definition, L-63historical background, L-60 to
L-61invalidate protocol
implementation, 356–357
limitations, 363–364performance, 366–378single-chip multicore case study,
412–418SMP and snooping limitations,
363–364snooping coherence
implementation, 365–366
snooping coherence protocols, 355–356
WSCs, 435, 441Shared-memory synchronization,
MIPS core extensions, K-21
Shared statecache block, 357, 359cache coherence, 360cache miss calculations, 366–367coherence extensions, 362directory-based cache coherence
protocol basics, 380, 385
private cache, 358Sharing addition, segmented virtual
memory, B-52 to B-53Shear algorithms, disk array
deconstruction, D-51 to D-52, D-52 to D-54
Shifting over zeros, integer multiplication/division, J-45 to J-47
Short-circuiting, see ForwardingSI format instructions, IBM 360, K-87Signals, definition, E-2Signal-to-noise ratio (SNR), wireless
networks, E-21Signed-digit representation
example, J-54integer multiplication, J-53
Signed number arithmetic, J-7 to J-10Sign-extended offset, RISC, C-4 to
C-5Significand, J-15Sign magnitude, J-7Silicon Graphics 4D/240, L-59Silicon Graphics Altix, F-76, L-63Silicon Graphics Challenge, L-60Silicon Graphics Origin, L-61, L-63Silicon Graphics systems (SGI)
economies of scale, 456miss statistics, B-59multiprocessor software
development, 407–409vector processor history, G-27
SIMD (Single Instruction Stream, Multiple Data Stream)
definition, 10Fermi GPU architectural
innovations, 305–308GPU conditional branching, 301
I-68 ■ Index
SIMD (continued )GPU examples, 325GPU programming, 289–290GPUs vs. vector architectures,
308–309historical overview, L-55 to L-56loop-level parallelism, 150MapReduce, 438memory bandwidth, 332multimedia extensions, see
Multimedia SIMD Extensions
multiprocessor architecture, 346multithreaded, see Multithreaded
SIMD ProcessorNVIDIA GPU computational
structures, 291NVIDIA GPU ISA, 300power/DLP issues, 322speedup via parallelism, 263supercomputer development, L-43
to L-44system area network history, F-100Thread Block mapping, 293TI 320C6x DSP, E-9
SIMD InstructionCUDA Thread, 303definition, 292, 313DSP media extensions, E-10function, 150, 291GPU Memory structures, 304GPUs, 300, 305Grid mapping, 293IBM Blue Gene/L, I-42Intel AVX, 438multimedia architecture
programming, 285multimedia extensions, 282–285,
312multimedia instruction compilers,
A-31 to A-32Multithreaded SIMD Processor
block diagram, 294PTX, 301Sony PlayStation 2, E-16Thread of SIMD Instructions,
295–296thread scheduling, 296–297, 297,
305vector architectures as superset,
263–264vector/GPU comparison, 308
Vector Registers, 309SIMD Lane Registers, definition, 309,
314SIMD Lanes
definition, 292, 296, 309DLP, 322Fermi GPU, 305, 307GPU, 296–297, 300, 324GPU conditional branching,
302–303GPUs vs. vector architectures, 308,
310, 311instruction scheduling, 297multimedia extensions, 285Multimedia SIMD vs. GPUs, 312,
315multithreaded processor, 294NVIDIA GPU Memory, 304synchronization marker, 301vector vs. GPU, 308, 311
SIMD Processors, see also Multithreaded SIMD Processor
block diagram, 294definition, 292, 309, 313–314dependent computation
elimination, 321design, 333Fermi GPU, 296, 305–308Fermi GTX 480 GPU floorplan,
295, 295–296GPU conditional branching, 302GPU vs. MIMD, 329GPU programming, 289–290GPUs vs. vector architectures, 310,
310–311Grid mapping, 293Multimedia SIMD vs. GPU, 312multiprocessor architecture, 346NVIDIA GPU computational
structures, 291NVIDIA GPU Memory structures,
304–305processor comparisons, 324Roofline model, 287, 326system area network history, F-100
SIMD ThreadGPU conditional branching,
301–302Grid mapping, 293Multithreaded SIMD processor,
294
NVIDIA GPU, 296NVIDIA GPU ISA, 298NVIDIA GPU Memory structures,
305scheduling example, 297vector vs. GPU, 308vector processor, 310
SIMD Thread Schedulerdefinition, 292, 314example, 297Fermi GPU, 295, 305–307, 306GPU, 296
SIMT (Single Instruction, Multiple Thread)
GPU programming, 289vs. SIMD, 314Warp, 313
Simultaneous multithreading (SMT)
characteristics, 226definition, 224–225historical background, L-34 to
L-35IBM eServer p5 575, 399ideal processors, 215Intel Core i7, 117–118, 239–241Java and PARSEC workloads,
403–404multicore performance/energy
efficiency, 402–405multiprocessing/
multithreading-based performance, 398–400
multithreading history, L-35superscalar processors, 230–232
Single-extended precision floating-point arithmetic, J-33 to J-34
Single Instruction, Multiple Thread, see SIMT (Single Instruction, Multiple Thread)
Single Instruction Stream, Multiple Data Stream, see SIMD (Single Instruction Stream, Multiple Data Stream)
Single Instruction Stream, Single Data Stream, see SISD (Single Instruction Stream, Single Data Stream)
Index ■ I-69
Single-level cache hierarchy, miss rates vs. cache size, B-33
Single-precision floating pointarithmetic, J-33 to J-34GPU examples, 325GPU vs. MIMD, 328MIPS data types, A-34MIPS operations, A-36Multimedia SIMD Extensions, 283operand sizes/types, 12, A-13as operand type, A-13 to A-14representation, J-15 to J-16
Single-Streaming Processor (SSP)Cray X1, G-21 to G-24Cray X1E, G-24
Single-thread (ST) performanceIBM eServer p5 575, 399, 399Intel Core i7, 239ISA, 242processor comparison, 243
SISD (Single Instruction Stream, Single Data Stream), 10
SIMD computer history, L-55Skippy algorithm
disk deconstruction, D-49sample results, D-50
SLAs, see Service level agreements (SLAs)
SLI, see Scan Line Interleave (SLI)SLOs, see Service level objectives
(SLOs)SLQ, see Serve-longest-queue (SLQ)
schemeSLT, see Set-on-less-than instructions
(SLT)SM, see Distributed shared memory
(DSM)Small Computer System Interface, see
SCSI (Small Computer System Interface)
Small form factor (SFF) disk, L-79Smalltalk, SPARC instructions, K-30Smart interface cards, vs. smart
switches, F-85 to F-86Smartphones
ARM Cortex-A8, 114mobile vs. server GPUs, 323–324
Smart switches, vs. smart interface cards, F-85 to F-86
SMP, see Symmetric multiprocessors (SMP)
SMT, see Simultaneous multithreading (SMT)
Snooping cache coherencebasic considerations, 355–356controller transitions, 421definition, 354–355directory-based, 381, 386,
420–421example, 357–362implementation, 365–366large-scale multiprocessor history,
L-61large-scale multiprocessors, I-34 to
I-35latencies, 414limitations, 363–364sample types, L-59single-chip multicore processor
case study, 412–418symmetric shared-memory
machines, 366SNR, see Signal-to-noise ratio
(SNR)SoC, see System-on-chip (SoC)Soft errors, definition, 104Soft real-time
definition, E-3PMDs, 6
Software as a Service (SaaS)clusters/WSCs, 8software development, 4WSCs, 438WSCs vs. servers, 433–434
Software developmentmultiprocessor architecture issues,
407–409performance vs. productivity, 4WSC efficiency, 450–452
Software pipeliningexample calculations, H-13 to
H-14loops, execution pattern, H-15technique, H-12 to H-15, H-13
Software prefetching, cache optimization, 131–133
Software speculationdefinition, 156vs. hardware speculation, 221–222VLIW, 196
Software technologyILP approaches, 148large-scale multiprocessors, I-6
large-scale multiprocessor synchronization, I-17 to I-18
network interfaces, F-7vs. TCP/IP reliance, F-95Virtual Machines protection, 108WSC running service, 434–435
Solaris, RAID benchmarks, D-22, D-22 to D-23
Solid-state disks (SSDs)processor performance/price/
power, 52server energy efficiency, 462WSC cost-performance, 474–475
Sonic Smart Interconnect, OCNs, F-3Sony PlayStation 2
block diagram, E-16embedded multiprocessors, E-14Emotion Engine case study, E-15
to E-18Emotion Engine organization,
E-18Sorting, case study, D-64 to D-67Sort primitive, GPU vs. MIMD, 329Sort procedure, VAX
bubble sort, K-76example code, K-77 to K-79vs. MIPS32, K-80register allocation, K-76
Source routing, basic concept, F-48SPARCLE processor, L-34Sparse matrices
loop-level parallelism dependences, 318–319
vector architectures, 279–280, G-12 to G-14
vector execution time, 271vector mask registers, 275
Spatial localitycoining of term, L-11definition, 45, B-2memory hierarchy design, 72
SPEC benchmarksbranch predictor correlation,
162–164desktop performance, 38–40early performance measures, L-7evolution, 39fallacies, 56operands, A-14performance, 38performance results reporting, 41
I-70 ■ Index
SPEC benchmarks (continued )processor performance growth, 3static branch prediction, C-26 to
C-27storage systems, D-20 to D-21tournament predictors, 164two-bit predictors, 165vector processor history, G-28
SPEC89 benchmarksbranch-prediction buffers, C-28 to
C-30, C-30MIPS FP pipeline performance,
C-61 to C-62misprediction rates, 166tournament predictors, 165–166VAX 8700 vs. MIPS M2000, K-82
SPEC92 benchmarkshardware vs. software speculation,
221ILP hardware model, 215MIPS R4000 performance, C-68 to
C-69, C-69misprediction rate, C-27
SPEC95 benchmarksreturn address predictors, 206–207,
207way prediction, 82
SPEC2000 benchmarksARM Cortex-A8 memory,
115–116cache performance prediction,
125–126cache size and misses per
instruction, 126compiler optimizations, A-29compulsory miss rate, B-23data reference sizes, A-44hardware prefetching, 91instruction misses, 127
SPEC2006 benchmarks, evolution, 39SPECCPU2000 benchmarks
displacement addressing mode, A-12
Intel Core i7, 122server benchmarks, 40
SPECCPU2006 benchmarksbranch predictors, 167Intel Core i7, 123–124, 240,
240–241ISA performance and efficiency
prediction, 241Virtual Machines protection, 108
SPECfp benchmarkshardware prefetching, 91interconnection network, F-87ISA performance and efficiency
prediction, 241–242Itanium 2, H-43MIPS FP pipeline performance,
C-60 to C-61nonblocking caches, 84tournament predictors, 164
SPECfp92 benchmarksIntel 80x86 vs. DLX, K-63Intel 80x86 instruction lengths,
K-60Intel 80x86 instruction mix, K-61Intel 80x86 operand type
distribution, K-59nonblocking cache, 83
SPECfp2000 benchmarkshardware prefetching, 92MIPS dynamic instruction mix,
A-42Sun Ultra 5 execution times, 43
SPECfp2006 benchmarksIntel processor clock rates, 244nonblocking cache, 83
SPECfpRate benchmarksmulticore processor performance,
400multiprocessor cost effectiveness,
407SMT, 398–400SMT on superscalar processors,
230SPEChpc96 benchmark, vector
processor history, G-28Special-purpose machines
historical background, L-4 to L-5SIMD computer history, L-56
Special-purpose registercompiler writing-architecture
relationship, A-30ISA classification, A-3VMIPS, 267
Special valuesfloating point, J-14 to J-15representation, J-16
SPECINT benchmarkshardware prefetching, 92interconnection network, F-87ISA performance and efficiency
prediction, 241–242
Itanium 2, H-43nonblocking caches, 84
SPECInt92 benchmarksIntel 80x86 vs. DLX, K-63Intel 80x86 instruction lengths,
K-60Intel 80x86 instruction mix, K-62Intel 80x86 operand type
distribution, K-59nonblocking cache, 83
SPECint95 benchmarks, interconnection networks, F-88
SPECINT2000 benchmarks, MIPS dynamic instruction mix, A-41
SPECINT2006 benchmarksIntel processor clock rates, 244nonblocking cache, 83
SPECintRate benchmarkmulticore processor performance,
400multiprocessor cost effectiveness,
407SMT, 398–400SMT on superscalar processors,
230SPEC Java Business Benchmark
(JBB)multicore processor performance,
400multicore processors, 402multiprocessing/
multithreading-based performance, 398
server, 40Sun T1 multithreading unicore
performance, 227–229, 229
SPECJVM98 benchmarks, ISA performance and efficiency prediction, 241
SPECMail benchmark, characteristics, D-20
SPEC-optimized processors, vs. density-optimized, F-85
SPECPower benchmarksGoogle server benchmarks,
439–440, 440multicore processor performance,
400
Index ■ I-71
real-world server considerations, 52–55
WSCs, 463WSC server energy efficiency,
462–463SPECRate benchmarks
Intel Core i7, 402multicore processor performance,
400multiprocessor cost effectiveness,
407server benchmarks, 40
SPECRate2000 benchmarks, SMT, 398–400
SPECRatiosexecution time examples, 43geometric means calculations,
43–44SPECSFS benchmarks
example, D-20servers, 40
Speculation, see also Hardware-based speculation; Software speculation
advantages/disadvantages, 210–211
compilers, see Compiler speculation
concept origins, L-29 to L-30and energy efficiency, 211–212FP unit with Tomasulo’s
algorithm, 185hardware vs. software, 221–222IA-64, H-38 to H-40ILP studies, L-32 to L-33Intel Core i7, 123–124latency hiding in consistency
models, 396–397memory reference, hardware
support, H-32and memory system, 222–223microarchitectural techniques case
study, 247–254multiple branches, 211register renaming vs. ROB,
208–210SPECvirt_Sc2010 benchmarks, server,
40SPECWeb benchmarks
characteristics, D-20dependability, D-21parallelism, 44server benchmarks, 40
SPECWeb99 benchmarksmultiprocessing/
multithreading-based performance, 398
Sun T1 multithreading unicore performance, 227, 229
SpeedupAmdahl’s law, 46–47floating-point addition, J-25 to
J-26integer addition
carry-lookahead, J-37 to J-41carry-lookahead circuit, J-38carry-lookahead tree, J-40 to
J-41carry-lookahead tree adder,
J-41carry-select adder, J-43, J-43 to
J-44, J-44carry-skip adder, J-41 to J43,
J-42overview, J-37
integer divisionradix-2 division, J-55radix-4 division, J-56radix-4 SRT division, J-57with single adder, J-54 to J-58
integer multiplicationarray multiplier, J-50Booth recoding, J-49even/odd array, J-52with many adders, J-50 to J-54multipass array multiplier,
J-51signed-digit addition table,
J-54with single adder, J-47 to J-49,
J-48Wallace tree, J-53
integer multiplication/division, shifting over zeros, J-45 to J-47
integer SRT division, J-45 to J-46, J-46
linear, 405–407via parallelism, 263pipeline with stalls, C-12 to C-13relative, 406scaled, 406–407switch buffer organizations, F-58
to F-59true, 406
Sperry-Rand, L-4 to L-5
Spin locksvia coherence, 389–390large-scale multiprocessor
synchronizationbarrier synchronization, I-16exponential back-off, I-17
SPLASH parallel benchmarks, SMT on superscalar processors, 230
Split, GPU vs. MIMD, 329SPRAM, Sony PlayStation 2 Emotion
Engine organization, E-18
Sprowl, Bob, F-99Squared coefficient of variance, D-27SRAM, see Static random-access
memory (SRAM)SRT division
chip comparison, J-60 to J-61complications, J-45 to J-46early computer arithmetic, J-65example, J-46historical background, J-63integers, with adder, J-55 to J-57radix-4, J-56, J-57
SSDs, see Solid-state disks (SSDs)SSE, see Intel Streaming SIMD
Extension (SSE)SS format instructions, IBM 360, K-85
to K-88ssj_ops, see Server side Java
operations per second (ssj_ops)
SSP, see Single-Streaming Processor (SSP)
Stack architectureand compiler technology, A-27flaws vs. success, A-44 to A-45historical background, L-16 to
L-17Intel 80x86, K-48, K-52, K-54operands, A-3 to A-4
Stack frame, VAX, K-71Stack pointer, VAX, K-71Stack or Thread Local Storage,
definition, 292Stale copy, cache coherency, 112Stall cycles
advanced directory protocol case study, 424
average memory access time, B-17branch hazards, C-21branch scheme performance, C-25
I-72 ■ Index
Stall cycles (continued)definition, B-4 to B-5example calculation, B-31loop unrolling, 161MIPS FP pipeline performance,
C-60miss rate calculation, B-6out-of-order processors, B-20 to
B-21performance equations, B-22pipeline performance, C-12 to
C-13single-chip multicore
multiprocessor case study, 414–418
structural hazards, C-15Stalls
AMD Opteron data cache, B-15ARM Cortex-A8, 235, 235–236branch hazards, C-42data hazard minimization, C-16 to
C-19, C-18data hazards requiring, C-19 to
C-21delayed branch, C-65Intel Core i7, 239–241microarchitectural techniques case
study, 252MIPS FP pipeline performance,
C-60 to C-61, C-61 to C-62
MIPS pipeline multicycle operations, C-51
MIPS R4000, C-64, C-67, C-67 to C-69, C-69
miss rate calculations, B-31 to B-32
necessity, C-21nonblocking cache, 84pipeline performance, C-12 to
C-13from RAW hazards, FP code, C-55structural hazard, C-15VLIW sample code, 252VMIPS, 268
Standardization, commercial interconnection networks, F-63 to F-64
Stardent-1500, Livermore Fortran kernels, 331
Start-up overhead, vs. peak performance, 331
Start-up timeDAXPY on VMIPS, G-20memory banks, 276page size selection, B-47peak performance, 331vector architectures, 331, G-4,
G-4, G-8vector convoys, G-4vector execution time, 270–271vector performance, G-2vector performance measures, G-16vector processor, G-7 to G-9, G-25VMIPS, G-5
State transition diagramdirector vs. cache, 385directory-based cache coherence,
383Statically based exploitation, ILP, H-2Static power
basic equation, 26SMT, 231
Static random-access memory (SRAM)
characteristics, 97–98dependability, 104fault detection pitfalls, 58power, 26vector memory systems, G-9vector processor, G-25yield, 32
Static schedulingdefinition, C-71ILP, 192–196and unoptimized code, C-81
Sticky bit, J-18Stop & Go, see Xon/XoffStorage area networks
dependability benchmarks, D-21 to D-23, D-22
historical overview, F-102 to F-103
I/O system as black blox, D-23Storage systems
asynchronous I/O and OSes, D-35Berkeley’s Tertiary Disk project,
D-12block servers vs. filers, D-34 to
D-35bus replacement, D-34component failure, D-43computer system availability, D-43
to D-44, D-44
dependability benchmarks, D-21 to D-23
dirty bits, D-61 to D-64disk array deconstruction case
study, D-51 to D-55, D-52 to D-55
disk arrays, D-6 to D-10disk deconstruction case study,
D-48 to D-51, D-50disk power, D-5disk seeks, D-45 to D-47disk storage, D-2 to D-5file system benchmarking, D-20,
D-20 to D-21Internet Archive Cluster, see
Internet Archive ClusterI/O performance, D-15 to D-16I/O subsystem design, D-59 to
D-61I/O system design/evaluation,
D-36 to D-37mail server benchmarking, D-20 to
D-21NetApp FAS6000 filer, D-41 to
D-42operator dependability, D-13 to
D-15OS-scheduled disk access, D-44 to
D-45, D-45point-to-point links, D-34, D-34queue I/O request calculations,
D-29queuing theory, D-23 to D-34RAID performance prediction,
D-57 to D-59RAID reconstruction case study,
D-55 to D-57real faults and failures, D-6 to
D-10reliability, D-44response time restrictions for
benchmarks, D-18seek distance comparison, D-47seek time vs. distance, D-46server utilization calculation, D-28
to D-29sorting case study, D-64 to D-67Tandem Computers, D-12 to D-13throughput vs. response time,
D-16, D-16 to D-18, D-17
TP benchmarks, D-18 to D-19
Index ■ I-73
transactions components, D-17web server benchmarking, D-20 to
D-21WSC vs. datacenter costs, 455WSCs, 442–443
Store conditionallocks via coherence, 391synchronization, 388–389
Store-and-forward packet switching, F-51
Store instructions, see also Load-store instruction set architecture
definition, C-4instruction execution, 186ISA, 11, A-3MIPS, A-33, A-36NVIDIA GPU ISA, 298Opteron data cache, B-15RISC instruction set, C-4 to C-6,
C-10vector architectures, 310
Streaming Multiprocessordefinition, 292, 313–314Fermi GPU, 307
Strecker, William, K-65Strided accesses
Multimedia SIMD Extensions, 283Roofline model, 287TLB interaction, 323
Strided addressing, see also Unit stride addressing
multimedia instruction compiler support, A-31 to A-32
Stridesgather-scatter, 280highly parallel memory systems,
133multidimensional arrays in vector
architectures, 278–279NVIDIA GPU ISA, 300vector memory systems, G-10 to
G-11VMIPS, 266
String operations, Intel 80x86, K-51, K-53
Stripe, disk array deconstruction, D-51Striping
disk arrays, D-6RAID, D-9
Strip-Mined Vector Loopconvoys, G-5
DAXPY on VMIPS, G-20definition, 292multidimensional arrays, 278Thread Block comparison, 294vector-length registers, 274
Strip miningDAXPY on VMIPS, G-20GPU conditional branching, 303GPUs vs. vector architectures, 311NVIDIA GPU, 291vector, 275VLRs, 274–275
Strong scaling, Amdahl’s law and parallel computers, 407
Structural hazardsbasic considerations, C-13 to C-16definition, C-11MIPS pipeline, C-71MIPS scoreboarding, C-78 to C-79pipeline stall, C-15vector execution time, 268–269
Structural stalls, MIPS R4000 pipeline, C-68 to C-69
Subset property, and inclusion, 397Summary overflow condition code,
PowerPC, K-10 to K-11Sun Microsystems
cache optimization, B-38fault detection pitfalls, 58memory dependability, 104
Sun Microsystems Enterprise, L-60Sun Microsystems Niagara (T1/T2)
processorscharacteristics, 227CPI and IPC, 399fine-grained multithreading, 224,
225, 226–229manufacturing cost, 62multicore processor performance,
400–401multiprocessing/
multithreading-based performance, 398–400
multithreading history, L-34T1 multithreading unicore
performance, 227–229Sun Microsystems SPARC
addressing modes, K-5ALU operands, A-6arithmetic/logical instructions,
K-11, K-31branch conditions, A-19
conditional branches, K-10, K-17
conditional instructions, H-27constant extension, K-9conventions, K-13data transfer instructions, K-10fast traps, K-30features, K-44FP instructions, K-23instruction list, K-31 to K-32integer arithmetic, J-12integer overflow, J-11ISA, A-2LISP, K-30MIPS core extensions, K-22 to K-23overlapped integer/FP operations,
K-31precise exceptions, C-60register windows, K-29 to K-30RISC history, L-20as RISC system, K-4Smalltalk, K-30synchronization history, L-64unique instructions, K-29 to K-32
Sun Microsystems SPARCCenter, L-60Sun Microsystems SPARCstation-2,
F-88Sun Microsystems SPARCstation-20,
F-88Sun Microsystems SPARC V8,
floating-point precisions, J-33
Sun Microsystems SPARC VIScharacteristics, K-18multimedia support, E-11, K-18
Sun Microsystems Ultra 5, SPECfp2000 execution times, 43
Sun Microsystems UltraSPARC, L-62, L-73
Sun Microsystems UltraSPARC T1 processor, characteristics, F-73
Sun Modular Datacenter, L-74 to L-75Superblock scheduling
basic process, H-21 to H-23compiler history, L-31example, H-22
Supercomputerscommercial interconnection
networks, F-63direct network topology, F-37
I-74 ■ Index
Supercomputers (continued)low-dimensional topologies, F-100SAN characteristics, F-76SIMD, development, L-43 to L-44vs. WSCs, 8
Superlinear performance, multiprocessors, 406
Superpipeliningdefinition, C-61performance histories, 20
Superscalar processorscoining of term, L-29ideal processors, 214–215ILP, 192–197, 246
studies, L-32microarchitectural techniques case
study, 250–251multithreading support, 225recent advances, L-33 to L-34register renaming code, 251rename table and register
substitution logic, 251SMT, 230–232VMIPS, 267
Superscalar registers, sample renaming code, 251
Supervisor process, virtual memory protection, 106
Sussenguth, Ed, L-28Sutherland, Ivan, L-34SVM, see Secure Virtual Machine
(SVM)Swap procedure, VAX
code example, K-72, K-74full procedure, K-75 to K-76overview, K-72 to K-76register allocation, K-72register preservation, B-74 to B-75
Swim, data cache misses, B-10Switched-media networks
basic characteristics, F-24vs. buses, F-2effective bandwidth vs. nodes,
F-28example, F-22latency and effective bandwidth,
F-26 to F-28vs. shared-media networks, F-24 to
F-25Switched networks
centralized, F-30 to F-34DOR, F-46
OCN history, F-104topology, F-40
Switchesarray, WSCs, 443–444Benes networks, F-33context, 307, B-49early LANs and WANs, F-29Ethernet switches, 16, 20, 53,
441–444, 464–465, 469interconnecting node calculations,
F-35vs. NIC, F-85 to F-86, F-86process switch, 224, B-37, B-49 to
B-50storage systems, D-34switched-media networks, F-24WSC hierarchy, 441–442, 442WSC infrastructure, 446WSC network bottleneck, 461
Switch fabric, switched-media networks, F-24
Switchingcommercial interconnection
networks, F-56interconnection networks, F-22,
F-27, F-50 to F-52network impact, F-52 to F-55performance considerations, F-92
to F-93SAN characteristics, F-76switched-media networks, F-24system area network history, F-100
Switch microarchitecturebasic microarchitecture, F-55 to
F-58buffer organizations, F-58 to F-60enhancements, F-62HOL blocking, F-59input-output-buffered switch, F-57pipelining, F-60 to F-61, F-61
Switch portscentralized switched networks, F-30interconnection network topology,
F-29Switch statements
control flow instruction addressing modes, A-18
GPU, 301Syllable, IA-64, H-35Symbolic loop unrolling, software
pipelining, H-12 to H-15, H-13
Symmetric multiprocessors (SMP)characteristics, I-45communication calculations, 350directory-based cache coherence,
354first vector computers, L-47, L-49limitations, 363–364snooping coherence protocols,
354–355system area network history, F-101TLP, 345
Symmetric shared-memory multiprocessors, see also Centralized shared-memory multiprocessors
data caching, 351–352limitations, 363–364performance
commercial workload, 367–369commercial workload
measurement, 369–374multiprogramming and OS
workload, 374–378overview, 366–367
scientific workloads, I-21 to I-26, I-23 to I-25
Synapse N + 1, L-59Synchronization
AltaVista search, 369basic considerations, 386–387basic hardware primitives,
387–389consistency models, 395–396cost, 403Cray X1, G-23definition, 375GPU comparisons, 329GPU conditional branching,
300–303historical background, L-64large-scale multiprocessors
barrier synchronization, I-13 to I-16, I-14, I-16
challenges, I-12 to I-16hardware primitives, I-18 to
I-21sense-reversing barrier, I-21software implementations, I-17
to I-18tree-based barriers, I-19
locks via coherence, 389–391
Index ■ I-75
message-passing communication, I-5
MIMD, 10MIPS core extensions, K-21programmer’s viewpoint, 393–394PTX instruction set, 298–299relaxed consistency models,
394–395single-chip multicore processor
case study, 412–418vector vs. GPU, 311VLIW, 196WSCs, 434
Synchronous dynamic random-access memory (SDRAM)
ARM Cortex-A8, 117DRAM, 99vs. Flash memory, 103IBM Blue Gene/L, I-42Intel Core i7, 121performance, 100power consumption, 102, 103SDRAM timing diagram, 139
Synchronous event, exception requirements, C-44 to C-45
Synchronous I/O, definition, D-35Synonyms
address translation, B-38dependability, 34
Synthetic benchmarksdefinition, 37typical program fallacy, A-43
System area networks, historical overview, F-100 to F-102
System callsCUDA Thread, 297multiprogrammed workload, 378virtualization/paravirtualization
performance, 141virtual memory protection, 106
System interface controller (SIF), Intel SCCC, F-70
System-on-chip (SoC)cell phone, E-24cross-company interoperability,
F-64embedded systems, E-3Sanyo digital cameras, E-20Sanyo VPC-SX500 digital camera,
E-19
shared-media networks, F-23System Performance and Evaluation
Cooperative (SPEC), see SPEC benchmarks
System Processordefinition, 309DLP, 262, 322Fermi GPU, 306GPU issues, 330GPU programming, 288–289NVIDIA GPU ISA, 298NVIDIA GPU Memory, 305processor comparisons, 323–324synchronization, 329vector vs. GPU, 311–312
System response time, transactions, D-16, D-17
Systems on a chip (SOC), cost trends, 28
System/storage area networks (SANs)characteristics, F-3 to F-4communication protocols, F-8congestion management, F-65cross-company interoperability, F-64effective bandwidth, F-18example system, F-72 to F-74fat trees, F-34fault tolerance, F-67InfiniBand example, F-74 to F-77interconnection network domain
relationship, F-4LAN history, F-99latency and effective bandwidth,
F-26 to F-28latency vs. nodes, F-27packet latency, F-13, F-14 to F-16routing algorithms, F-48software overhead, F-91TCP/IP reliance, F-95time of flight, F-13topology, F-30
System Virtual Machines, definition, 107
TTag
AMD Opteron data cache, B-12 to B-14
ARM Cortex-A8, 115cache optimization, 79–80dynamic scheduling, 177invalidate protocols, 357
memory hierarchy basics, 74memory hierarchy basics, 77–78virtual memory fast address
translation, B-46write strategy, B-10
Tag check (TC)MIPS R4000, C-63R4000 pipeline, B-62 to B-63R4000 pipeline structure, C-63write process, B-10
Tag fieldsblock identification, B-8dynamic scheduling, 173, 175
Tail duplication, superblock scheduling, H-21
Tailgating, definition, G-20Tandem Computers
cluster history, L-62, L-72faults, D-14overview, D-12 to D-13
Target addressbranch hazards, C-21, C-42branch penalty reduction, C-22 to
C-23branch-target buffer, 206control flow instructions, A-17 to
A-18GPU conditional branching, 301Intel Core i7 branch predictor, 166MIPS control flow instructions,
A-38MIPS implementation, C-32MIPS pipeline, C-36, C-37MIPS R4000, C-25pipeline branches, C-39RISC instruction set, C-5
Target channel adapters (TCAs), switch vs. NIC, F-86
Target instructionsbranch delay slot scheduling, C-24as branch-target buffer variation,
206GPU conditional branching, 301
Task-level parallelism (TLP), definition, 9
TB, see Translation buffer (TB)TB-80 VME rack
example, D-38MTTF calculation, D-40 to D-41
TC, see Tag check (TC)TCAs, see Target channel adapters
(TCAs)
I-76 ■ Index
TCO, see Total Cost of Ownership (TCO)
TCP, see Transmission Control Protocol (TCP)
TCP/IP, see Transmission Control Protocol/Internet Protocol (TCP/IP)
TDMA, see Time division multiple access (TDMA)
TDP, see Thermal design power (TDP)
Technology trendsbasic considerations, 17–18performance, 18–19
Teleconferencing, multimedia support, K-17
Temporal localityblocking, 89–90cache optimization, B-26coining of term, L-11definition, 45, B-2memory hierarchy design, 72
TERA processor, L-34Terminate events
exceptions, C-45 to C-46hardware-based speculation, 188loop unrolling, 161
Tertiary Disk projectfailure statistics, D-13overview, D-12system log, D-43
Test-and-set operation, synchronization, 388
Texas Instruments 8847arithmetic functions, J-58 to J-61chip comparison, J-58chip layout, J-59
Texas Instruments ASCfirst vector computers, L-44peak performance vs. start-up
overhead, 331TFLOPS, parallel processing debates,
L-57 to L-58TFT, see Thin-film transistor (TFT)Thacker, Chuck, F-99Thermal design power (TDP), power
trends, 22Thin-film transistor (TFT), Sanyo
VPC-SX500 digital camera, E-19
Thinking Machines, L-44, L-56Thinking Multiprocessors CM-5, L-60
Think time, transactions, D-16, D-17Third-level caches, see also L3 caches
ILP, 245interconnection network, F-87SRAM, 98–99
Thrash, memory hierarchy, B-25Thread Block
CUDA Threads, 297, 300, 303definition, 292, 313Fermi GTX 480 GPU flooplan,
295function, 294GPU hardware levels, 296GPU Memory performance, 332GPU programming, 289–290Grid mapping, 293mapping example, 293multithreaded SIMD Processor, 294NVIDIA GPU computational
structures, 291NVIDIA GPU Memory structures,
304PTX Instructions, 298
Thread Block Schedulerdefinition, 292, 309, 313–314Fermi GTX 480 GPU flooplan, 295function, 294, 311GPU, 296Grid mapping, 293multithreaded SIMD Processor, 294
Thread-level parallelism (TLP)advanced directory protocol case
study, 420–426Amdahl’s law and parallel
computers, 406–407centralized shared-memory
multiprocessorsbasic considerations, 351–352cache coherence, 352–353cache coherence enforcement,
354–355cache coherence example,
357–362cache coherence extensions,
362–363invalidate protocol
implementation, 356–357
SMP and snooping limitations, 363–364
snooping coherence implementation, 365–366
snooping coherence protocols, 355–356
definition, 9directory-based cache coherence
case study, 418–420protocol basics, 380–382protocol example, 382–386
DSM and directory-based coherence, 378–380
embedded systems, E-15IBM Power7, 215from ILP, 4–5inclusion, 397–398Intel Core i7 performance/energy
efficiency, 401–405memory consistency models
basic considerations, 392–393compiler optimization, 396programming viewpoint,
393–394relaxed consistency models,
394–395speculation to hide latency,
396–397MIMDs, 344–345multicore processor performance,
400–401multicore processors and SMT,
404–405multiprocessing/
multithreading-based performance, 398–400
multiprocessor architecture, 346–348
multiprocessor cost effectiveness, 407multiprocessor performance,
405–406multiprocessor software
development, 407–409vs. multithreading, 223–224multithreading history, L-34 to L-35parallel processing challenges,
349–351single-chip multicore processor
case study, 412–418Sun T1 multithreading, 226–229symmetric shared-memory
multiprocessor performance
commercial workload, 367–369commercial workload
measurement, 369–374
Index ■ I-77
multiprogramming and OS workload, 374–378
overview, 366–367synchronization
basic considerations, 386–387basic hardware primitives,
387–389locks via coherence, 389–391
Thread Processordefinition, 292, 314GPU, 315
Thread Processor Registers, definition, 292
Thread Scheduler in a Multithreaded CPU, definition, 292
Thread of SIMD Instructionscharacteristics, 295–296CUDA Thread, 303definition, 292, 313Grid mapping, 293lane recognition, 300scheduling example, 297terminology comparison, 314vector/GPU comparison, 308–309
Thread of Vector Instructions, definition, 292
Three-dimensional space, direct networks, F-38
Three-level cache hierarchycommercial workloads, 368ILP, 245Intel Core i7, 118, 118
Throttling, packets, F-10Throughput, see also Bandwidth
definition, C-3, F-13disk storage, D-4Google WSC, 470ILP, 245instruction fetch bandwidth, 202Intel Core i7, 236–237kernel characteristics, 327memory banks, 276multiple lanes, 271parallelism, 44performance considerations, 36performance trends, 18–19pipelining basics, C-10precise exceptions, C-60producer-server model, D-16vs. response time, D-17routing comparison, F-54server benchmarks, 40–41
servers, 7storage systems, D-16 to D-18uniprocessors, TLP
basic considerations, 223–226fine-grained multithreading on
Sun T1, 226–229superscalar SMT, 230–232
and virtual channels, F-93WSCs, 434
Tickscache coherence, 391processor performance equation,
48–49Tilera TILE-Gx processors, OCNs,
F-3Time-cost relationship, components,
27–28Time division multiple access
(TDMA), cell phones, E-25
Time of flightcommunication latency, I-3 to I-4interconnection networks, F-13
Timing independent, L-17 to L-18TI TMS320C6x DSP
architecture, E-9characteristics, E-8 to E-10instruction packet, E-10
TI TMS320C55 DSParchitecture, E-7characteristics, E-7 to E-8data operands, E-6
TLB, see Translation lookaside buffer (TLB)
TLP, see Task-level parallelism (TLP); Thread-level parallelism (TLP)
Tomasulo’s algorithmadvantages, 177–178dynamic scheduling, 170–176FP unit, 185loop-based example, 179, 181–183MIP FP unit, 173register renaming vs. ROB, 209step details, 178, 180
TOP500, L-58Top Of Stack (TOS) register, ISA
operands, A-4Topology
Bens networks, F-33centralized switched networks,
F-30 to F-34, F-31
definition, F-29direct networks, F-37distributed switched networks,
F-34 to F-40interconnection networks, F-21 to
F-22, F-44basic considerations, F-29 to
F-30fault tolerance, F-67
network performance and cost, F-40
network performance effects, F-40 to F-44
rings, F-36routing/arbitration/switching
impact, F-52system area network history, F-100
to F-101Torus networks
characteristics, F-36commercial interconnection
networks, F-63direct networks, F-37fault tolerance, F-67IBM Blue Gene/L, F-72 to F-74NEWS communication, F-43routing comparison, F-54system area network history, F-102
TOS, see Top Of Stack (TOS) registerTotal Cost of Ownership (TCO), WSC
case study, 476–479Total store ordering, relaxed
consistency models, 395Tournament predictors
early schemes, L-27 to L-28ILP for realizable processors, 216local/global predictor
combinations, 164–166Toy programs, performance
benchmarks, 37TP, see Transaction-processing (TP)TPC, see Transaction Processing
Council (TPC)Trace compaction, basic process, H-19Trace scheduling
basic approach, H-19 to H-21overview, H-20
Trace selection, definition, H-19Tradebeans benchmark, SMT on
superscalar processors, 230
Traffic intensity, queuing theory, D-25
I-78 ■ Index
Trailermessages, F-6packet format, F-7
Transaction components, D-16, D-17, I-38 to I-39
Transaction-processing (TP)server benchmarks, 41storage system benchmarks, D-18
to D-19Transaction Processing Council (TPC)
benchmarks overview, D-18 to D-19, D-19
parallelism, 44performance results reporting, 41server benchmarks, 41TPC-B, shared-memory
workloads, 368TPC-C
file system benchmarking, D-20
IBM eServer p5 processor, 409multiprocessing/
multithreading-based performance, 398
multiprocessor cost effectiveness, 407
single vs. multiple thread executions, 228
Sun T1 multithreading unicore performance, 227–229, 229
WSC services, 441TPC-D, shared-memory
workloads, 368–369TPC-E, shared-memory
workloads, 368–369Transfers, see also Data transfers
as early control flow instruction definition, A-16
Transforms, DSP, E-5Transient failure, commercial
interconnection networks, F-66
Transient faults, storage systems, D-11Transistors
clock rate considerations, 244dependability, 33–36energy and power, 23–26ILP, 245performance scaling, 19–21processor comparisons, 324processor trends, 2
RISC instructions, A-3shrinking, 55static power, 26technology trends, 17–18
Translation buffer (TB)virtual memory block
identification, B-45virtual memory fast address
translation, B-46Translation lookaside buffer (TLB)
address translation, B-39AMD64 paged virtual memory,
B-56 to B-57ARM Cortex-A8, 114–115cache optimization, 80, B-37coining of term, L-9Intel Core i7, 118, 120–121interconnection network
protection, F-86memory hierarchy, B-48 to B-49memory hierarchy basics, 78MIPS64 instructions, K-27Opteron, B-47Opteron memory hierarchy, B-57RISC code size, A-23shared-memory workloads,
369–370speculation advantages/
disadvantages, 210–211strided access interactions,
323Virtual Machines, 110virtual memory block
identification, B-45virtual memory fast address
translation, B-46virtual memory page size selection,
B-47virtual memory protection,
106–107Transmission Control Protocol (TCP),
congestion management, F-65
Transmission Control Protocol/Internet Protocol (TCP/IP)
ATM, F-79headers, F-84internetworking, F-81, F-83 to
F-84, F-89reliance on, F-95WAN history, F-98
Transmission speed, interconnection network performance, F-13
Transmission timecommunication latency, I-3 to I-4time of flight, F-13 to F-14
Transport latencytime of flight, F-14topology, F-35 to F-36
Transport layer, definition, F-82Transputer, F-100Tree-based barrier, large-scale
multiprocessor synchronization, I-19
Tree height reduction, definition, H-11Trees, MINs with nonblocking, F-34Trellis codes, definition, E-7TRIPS Edge processor, F-63
characteristics, F-73Trojan horses
definition, B-51segmented virtual memory, B-53
True dependencefinding, H-7 to H-8loop-level parallelism calculations,
320vs. name dependence, 153
True sharing missescommercial workloads, 371, 373definition, 366–367multiprogramming workloads, 377
True speedup, multiprocessor performance, 406
TSMC, Stratton, F-3TSS operating system, L-9Turbo mode
hardware enhancements, 56microprocessors, 26
Turing, Alan, L-4, L-19Turn Model routing algorithm,
example calculations, F-47 to F-48
Two-level branch predictorsbranch costs, 163Intel Core i7, 166tournament predictors, 165
Two-level cache hierarchycache optimization, B-31ILP, 245
Two’s complement, J-7 to J-8Two-way conflict misses, definition,
B-23
Index ■ I-79
Two-way set associativityARM Cortex-A8, 233cache block placement, B-7, B-8cache miss rates, B-24cache miss rates vs. size, B-33cache optimization, B-38cache organization calculations,
B-19 to B-20commercial workload, 370–373,
371multiprogramming workload,
374–375nonblocking cache, 84Opteron data cache, B-13 to B-142:1 cache rule of thumb, B-29virtual to cache access scenario,
B-39TX-2, L-34, L-49“Typical” program, instruction set
considerations, A-43
UU, see Rack units (U)Ultrix, DECstation 5000 reboots, F-69UMA, see Uniform memory access
(UMA)Unbiased exponent, J-15Uncached state, directory-based cache
coherence protocol basics, 380, 384–386
Unconditional branchesbranch folding, 206branch-prediction schemes, C-25
to C-26VAX, K-71
Underflowfloating-point arithmetic, J-36 to
J-37, J-62gradual, J-15
Unicasting, shared-media networks, F-24
Unicode characterMIPS data types, A-34operand sizes/types, 12popularity, A-14
Unified cacheAMD Opteron example, B-15performance, B-16 to B-17
Uniform memory access (UMA)multicore single-chip
multiprocessor, 364SMP, 346–348
Uninterruptible instructionhardware primitives, 388synchronization, 386
Uninterruptible power supply (UPS)Google WSC, 467WSC calculations, 435WSC infrastructure, 447
Uniprocessorscache protocols, 359development views, 344linear speedups, 407memory hierarchy design, 73memory system coherency, 353,
358misses, 371, 373multiprogramming workload,
376–377multithreading
basic considerations, 223–226fine-grained on T1, 226–229simultaneous, on superscalars,
230–232parallel vs. sequential programs,
405–406processor performance trends, 3–4,
344SISD, 10software development, 407–408
Unit stride addressinggather-scatter, 280GPU vs. MIMD with Multimedia
SIMD, 327GPUs vs. vector architectures, 310multimedia instruction compiler
support, A-31NVIDIA GPU ISA, 300Roofline model, 287
UNIVAC I, L-5UNIX systems
architecture costs, 2block servers vs. filers, D-35cache optimization, B-38floating point remainder, J-32miss statistics, B-59multiprocessor software
development, 408multiprogramming workload, 374seek distance comparison, D-47vector processor history, G-26
Unpacked decimal, A-14, J-16Unshielded twisted pair (UTP), LAN
history, F-99
Up*/down* routingdefinition, F-48fault tolerance, F-67
UPS, see Uninterruptible power supply (UPS)
USB, Sony PlayStation 2 Emotion Engine case study, E-15
Use bitaddress translation, B-46segmented virtual memory, B-52virtual memory block replacement,
B-45User-level communication, definition,
F-8User maskable events, definition, C-45
to C-46User nonmaskable events, definition,
C-45User-requested events, exception
requirements, C-45Utility computing, 455–461, L-73 to
L-74Utilization
I/O system calculations, D-26queuing theory, D-25
UTP, see Unshielded twisted pair (UTP)
VValid bit
address translation, B-46block identification, B-7Opteron data cache, B-14paged virtual memory, B-56segmented virtual memory, B-52snooping, 357symmetric shared-memory
multiprocessors, 366Value prediction
definition, 202hardware-based speculation, 192ILP, 212–213, 220speculation, 208
VAPI, InfiniBand, F-77Variable length encoding
control flow instruction branches, A-18
instruction sets, A-22ISAs, 14
Variablesand compiler technology, A-27 to
A-29
I-80 ■ Index
Variables (continued)CUDA, 289Fermi GPU, 306ISA, A-5, A-12locks via coherence, 389loop-level parallelism, 316memory consistency, 392NVIDIA GPU Memory, 304–305procedure invocation options,
A-19random, distribution, D-26 to D-34register allocation, A-26 to A-27in registers, A-5synchronization, 375TLP programmer’s viewpoint, 394
VCs, see Virtual channels (VCs)Vector architectures
computer development, L-44 to L-49definition, 9DLP
basic considerations, 264definition terms, 309gather/scatter operations,
279–280multidimensional arrays,
278–279multiple lanes, 271–273programming, 280–282vector execution time, 268–271vector-length registers,
274–275vector load/store unit
bandwidth, 276–277vector-mask registers, 275–276vector processor example,
267–268VMIPS, 264–267
GPU conditional branching, 303vs. GPUs, 308–312mapping examples, 293memory systems, G-9 to G-11multimedia instruction compiler
support, A-31vs. Multimedia SIMD Extensions,
282peak performance vs. start-up
overhead, 331power/DLP issues, 322vs. scalar performance, 331–332start-up latency and dead time, G-8strided access-TLB interactions,
323
vector-register characteristics, G-3Vector Functional Unit
vector add instruction, 272–273vector execution time, 269vector sequence chimes, 270VMIPS, 264
Vector Instructiondefinition, 292, 309DLP, 322Fermi GPU, 305gather-scatter, 280instruction-level parallelism, 150mask registers, 275–276Multimedia SIMD Extensions, 282multiple lanes, 271–273Thread of Vector Instructions, 292vector execution time, 269vector vs. GPU, 308, 311vector processor example, 268VMIPS, 265–267, 266
Vectorizable Loopcharacteristics, 268definition, 268, 292, 313Grid mapping, 293Livermore Fortran kernel
performance, 331mapping example, 293NVIDIA GPU computational
structures, 291Vectorized code
multimedia compiler support, A-31vector architecture programming,
280–282vector execution time, 271VMIPS, 268
Vectorized Loop, see also Body of Vectorized Loop
definition, 309GPU Memory structure, 304vs. Grid, 291, 308mask registers, 275NVIDIA GPU, 295vector vs. GPU, 308
Vectorizing compilerseffectiveness, G-14 to G-15FORTRAN test kernels, G-15sparse matrices, G-12 to G-13
Vector Lane Registers, definition, 292Vector Lanes
control processor, 311definition, 292, 309SIMD Processor, 296–297, 297
Vector-length register (VLR)basic operation, 274–275performance, G-5VMIPS, 267
Vector load/store unitmemory banks, 276–277VMIPS, 265
Vector loopsNVIDIA GPU, 294processor example, 267strip-mining, 303vector vs. GPU, 311vector-length registers, 274–275vector-mask registers, 275–276
Vector-mask control, characteristics, 275–276
Vector-mask registersbasic operation, 275–276Cray X1, G-21 to G-22VMIPS, 267
Vector Processorcaches, 305compiler vectorization, 281Cray X1
MSP modules, G-22overview, G-21 to G-23
Cray X1E, G-24definition, 292, 309DLP processors, 322DSP media extensions, E-10example, 267–268execution time, G-7functional units, 272gather-scatter, 280vs. GPUs, 276historical background, G-26loop-level parallelism, 150loop unrolling, 196measures, G-15 to G-16memory banks, 277and multiple lanes, 273, 310multiprocessor architecture, 346NVIDIA GPU computational
structures, 291overview, G-25 to G-26peak performance focus, 331performance, G-2 to G-7
start-up and multiple lanes, G-7 to G-9
performance comparison, 58performance enhancement
chaining, G-11 to G-12
Index ■ I-81
DAXPY on VMIPS, G-19 to G-21
sparse matrices, G-12 to G-14PTX, 301Roofline model, 286–287, 287vs. scalar processor, 311, 331, 333,
G-19vs. SIMD Processor, 294–296Sony PlayStation 2 Emotion
Engine, E-17 to E-18start-up overhead, G-4stride, 278strip mining, 275vector execution time, 269–271vector/GPU comparison, 308vector kernel implementation,
334–336VMIPS, 264–265VMIPS on DAXPY, G-17VMIPS on Linpack, G-17 to G-19
Vector Registersdefinition, 309execution time, 269, 271gather-scatter, 280multimedia compiler support, A-31Multimedia SIMD Extensions, 282multiple lanes, 271–273NVIDIA GPU, 297NVIDIA GPU ISA, 298performance/bandwidth trade-offs,
332processor example, 267strides, 278–279vector vs. GPU, 308, 311VMIPS, 264–267, 266
Very-large-scale integration (VLSI)early computer arithmetic, J-63interconnection network topology,
F-29RISC history, L-20Wallace tree, J-53
Very Long Instruction Word (VLIW)clock rates, 244compiler scheduling, L-31EPIC, L-32IA-64, H-33 to H-34ILP, 193–196loop-level parallelism, 315M32R, K-39 to K-40multiple-issue processors, 194,
L-28 to L-30multithreading history, L-34
sample code, 252TI 320C6x DSP, E-8 to E-10
VGA controller, L-51Video
Amazon Web Services, 460application trends, 4PMDs, 6WSCs, 8, 432, 437, 439
Video games, multimedia support, K-17
VI interface, L-73Virtual address
address translation, B-46AMD64 paged virtual memory, B-55AMD Opteron data cache, B-12 to
B-13ARM Cortex-A8, 115cache optimization, B-36 to B-39GPU conditional branching, 303Intel Core i7, 120mapping to physical, B-45memory hierarchy, B-39, B-48,
B-48 to B-49memory hierarchy basics, 77–78miss rate vs. cache size, B-37Opteron mapping, B-55Opteron memory management,
B-55 to B-56and page size, B-58page table-based mapping, B-45translation, B-36 to B-39virtual memory, B-42, B-49
Virtual address spaceexample, B-41main memory block, B-44
Virtual cachesdefinition, B-36 to B-37issues with, B-38
Virtual channels (VCs), F-47HOL blocking, F-59Intel SCCC, F-70routing comparison, F-54switching, F-51 to F-52switch microarchitecture
pipelining, F-61system area network history, F-101and throughput, F-93
Virtual cut-through switching, F-51Virtual functions, control flow
instructions, A-18Virtualizable architecture
Intel 80x86 issues, 128
system call performance, 141Virtual Machines support, 109VMM implementation, 128–129
Virtualizable GPUs, future technology, 333
Virtual machine monitor (VMM)characteristics, 108nonvirtualizable ISA, 126,
128–129requirements, 108–109Virtual Machines ISA support,
109–110Xen VM, 111
Virtual Machines (VMs)Amazon Web Services, 456–457cloud computing costs, 471early IBM work, L-10ISA support, 109–110protection, 107–108protection and ISA, 112server benchmarks, 40and virtual memory and I/O,
110–111WSCs, 436Xen VM, 111
Virtual memorybasic considerations, B-40 to B-44,
B-48 to B-49basic questions, B-44 to B-46block identification, B-44 to B-45block placement, B-44block replacement, B-45vs. caches, B-42 to B-43classes, B-43definition, B-3fast address translation, B-46Multimedia SIMD Extensions, 284multithreading, 224paged example, B-54 to B-57page size selection, B-46 to B-47parameter ranges, B-42Pentium vs. Opteron protection,
B-57protection, 105–107segmented example, B-51 to B-54strided access-TLB interactions,
323terminology, B-42Virtual Machines impact, 110–111writes, B-45 to B-46
Virtual methods, control flow instructions, A-18
I-82 ■ Index
Virtual output queues (VOQs), switch microarchitecture, F-60
VLIW, see Very Long Instruction Word (VLIW)
VLR, see Vector-length register (VLR)
VLSI, see Very-large-scale integration (VLSI)
VMCS, see Virtual Machine Control State (VMCS)
VME rackexample, D-38Internet Archive Cluster, D-37
VMIPSbasic structure, 265DAXPY, G-18 to G-20DLP, 265–267double-precision FP operations,
266enhanced, DAXPY performance,
G-19 to G-21gather/scatter operations, 280ISA components, 264–265multidimensional arrays, 278–279Multimedia SIMD Extensions, 282multiple lanes, 271–272peak performance on DAXPY,
G-17performance, G-4performance on Linpack, G-17 to
G-19sparse matrices, G-13start-up penalties, G-5vector execution time, 269–270,
G-6 to G-7vector vs. GPU, 308vector-length registers, 274vector load/store unit bandwidth,
276vector performance measures,
G-16vector processor example,
267–268VLR, 274
VMM, see Virtual machine monitor (VMM)
VMs, see Virtual Machines (VMs)Voltage regulator controller (VRC),
Intel SCCC, F-70Voltage regulator modules (VRMs),
WSC server energy efficiency, 462
Volume-cost relationship, components, 27–28
Von Neumann, John, L-2 to L-6Von Neumann computer, L-3Voodoo2, L-51VOQs, see Virtual output queues
(VOQs)VRC, see Voltage regulator controller
(VRC)VRMs, see Voltage regulator modules
(VRMs)
WWafers
example, 31integrated circuit cost trends,
28–32Wafer yield
chip costs, 32definition, 30
Waiting line, definition, D-24Wait time, shared-media networks,
F-23Wallace tree
example, J-53, J-53historical background, J-63
Wall-clock timeexecution time, 36scientific applications on parallel
processors, I-33WANs, see Wide area networks
(WANs)WAR, see Write after read (WAR)Warehouse-scale computers (WSCs)
Amazon Web Services, 456–461basic concept, 432characteristics, 8cloud computing, 455–461cloud computing providers,
471–472cluster history, L-72 to L-73computer architecture
array switch, 443basic considerations, 441–442memory hierarchy, 443,
443–446, 444storage, 442–443
as computer class, 5computer cluster forerunners,
435–436cost-performance, 472–473costs, 452–455, 453–454
definition, 345and ECC memory, 473–474efficiency measurement, 450–452facility capital costs, 472Flash memory, 474–475Google
containers, 464–465cooling and power, 465–468monitoring and repairing,
469–470PUE, 468server, 467servers, 468–469
MapReduce, 437–438network as bottleneck, 461physical infrastructure and costs,
446–450power modes, 472programming models and
workloads, 436–441query response-time curve, 482relaxed consistency, 439resource allocation, 478–479server energy efficiency, 462–464vs. servers, 432–434SPECPower benchmarks, 463switch hierarchy, 441–442, 442TCO case study, 476–478
Warp, L-31definition, 292, 313terminology comparison, 314
Warp Schedulerdefinition, 292, 314Multithreaded SIMD Processor,
294Wavelength division multiplexing
(WDM), WAN history, F-98
WAW, see Write after write (WAW)Way prediction, cache optimization,
81–82Way selection, 82WB, see Write-back cycle (WB)WCET, see Worst-case execution time
(WCET)WDM, see Wavelength division
multiplexing (WDM)Weak ordering, relaxed consistency
models, 395Weak scaling, Amdahl’s law and
parallel computers, 406–407
Index ■ I-83
Web index search, shared-memory workloads, 369
Web serversbenchmarking, D-20 to D-21dependability benchmarks, D-21ILP for realizable processors, 218performance benchmarks, 40WAN history, F-98
Weighted arithmetic mean time, D-27Weitek 3364
arithmetic functions, J-58 to J-61chip comparison, J-58chip layout, J-60
West-first routing, F-47 to F-48Wet-bulb temperature
Google WSC, 466WSC cooling systems, 449
Whirlwind project, L-4Wide area networks (WANs)
ATM, F-79characteristics, F-4cross-company interoperability, F-64effective bandwidth, F-18fault tolerance, F-68historical overview, F-97 to F-99InfiniBand, F-74interconnection network domain
relationship, F-4latency and effective bandwidth,
F-26 to F-28offload engines, F-8packet latency, F-13, F-14 to F-16routers/gateways, F-79switches, F-29switching, F-51time of flight, F-13topology, F-30
Wilkes, Maurice, L-3Winchester, L-78Window
latency, B-21processor performance
calculations, 218scoreboarding definition, C-78TCP/IP headers, F-84
Windowing, congestion management, F-65
Window sizeILP limitations, 221ILP for realizable processors,
216–217vs. parallelism, 217
Windows operating systems, see Microsoft Windows
Wireless networksbasic challenges, E-21and cell phones, E-21 to E-22
Wiresenergy and power, 23scaling, 19–21
Within instruction exceptionsdefinition, C-45instruction set complications, C-50stopping/restarting execution, C-46
Word count, definition, B-53Word displacement addressing, VAX,
K-67Word offset, MIPS, C-32Words
aligned/misaligned addresses, A-8AMD Opteron data cache, B-15DSP, E-6Intel 80x86, K-50memory address interpretation,
A-7 to A-8MIPS data transfers, A-34MIPS data types, A-34MIPS unaligned reads, K-26operand sizes/types, 12as operand type, A-13 to A-14VAX, K-70
Working set effect, definition, I-24Workloads
execution time, 37Google search, 439Java and PARSEC without SMT,
403–404RAID performance prediction,
D-57 to D-59symmetric shared-memory
multiprocessor performance, 367–374, I-21 to I-26
WSC goals/requirements, 433WSC resource allocation case
study, 478–479WSCs, 436–441
Wormhole switching, F-51, F-88performance issues, F-92 to F-93system area network history, F-101
Worst-case execution time (WCET), definition, E-4
Write after read (WAR)data hazards, 153–154, 169
dynamic scheduling with Tomasulo’s algorithm, 170–171
hazards and forwarding, C-55ILP limitation studies, 220MIPS scoreboarding, C-72, C-74
to C-75, C-79multiple-issue processors, L-28register renaming vs. ROB, 208ROB, 192TI TMS320C55 DSP, E-8Tomasulo’s advantages, 177–178Tomasulo’s algorithm, 182–183
Write after write (WAW)data hazards, 153, 169dynamic scheduling with
Tomasulo’s algorithm, 170–171
execution sequences, C-80hazards and forwarding, C-55 to
C-58ILP limitation studies, 220microarchitectural techniques case
study, 253MIPS FP pipeline performance,
C-60 to C-61MIPS scoreboarding, C-74, C-79multiple-issue processors, L-28register renaming vs. ROB, 208ROB, 192Tomasulo’s advantages, 177–178
Write allocateAMD Opteron data cache, B-12definition, B-11example calculation, B-12
Write-back cacheAMD Opteron example, B-12, B-14coherence maintenance, 381coherency, 359definition, B-11directory-based cache coherence,
383, 386Flash memory, 474FP register file, C-56invalidate protocols, 355–357, 360memory hierarchy basics, 75snooping coherence, 355,
356–357, 359Write-back cycle (WB)
basic MIPS pipeline, C-36data hazard stall minimization,
C-17
I-84 ■ Index
Write-back cycle (continued )execution sequences, C-80hazards and forwarding, C-55 to
C-56MIPS exceptions, C-49MIPS pipeline, C-52MIPS pipeline control, C-39MIPS R4000, C-63, C-65MIPS scoreboarding, C-74pipeline branch issues, C-40RISC classic pipeline, C-7 to C-8,
C-10simple MIPS implementation,
C-33simple RISC implementation, C-6
Write broadcast protocol, definition, 356
Write bufferAMD Opteron data cache, B-14Intel Core i7, 118, 121invalidate protocol, 356memory consistency, 393memory hierarchy basics, 75miss penalty reduction, 87, B-32,
B-35 to B-36write merging example, 88write strategy, B-11
Write hitcache coherence, 358directory-based coherence, 424single-chip multicore
multiprocessor, 414snooping coherence, 359write process, B-11
Write invalidate protocoldirectory-based cache coherence
protocol example, 382–383
example, 359, 360implementation, 356–357snooping coherence, 355–356
Write mergingexample, 88miss penalty reduction, 87
Write missAMD Opteron data cache, B-12,
B-14cache coherence, 358, 359, 360, 361definition, 385directory-based cache coherence,
380–383, 385–386example calculation, B-12locks via coherence, 390memory hierarchy basics, 76–77memory stall clock cycles, B-4Opteron data cache, B-12, B-14snooping cache coherence, 365write process, B-11 to B-12write speed calculations, 393
Write result stagedata hazards, 154dynamic scheduling, 174–175hardware-based speculation, 192instruction steps, 175ROB instruction, 186scoreboarding, C-74 to C-75, C-78
to C-80status table examples, C-77Tomasulo’s algorithm, 178, 180,
190Write serialization
hardware primitives, 387multiprocessor cache coherency,
353snooping coherence, 356
Write stall, definition, B-11Write strategy
memory hierarchy considerations, B-6, B-10 to B-12
virtual memory, B-45 to B-46Write-through cache
average memory access time, B-16
coherency, 352invalidate protocol, 356memory hierarchy basics, 74–75miss penalties, B-32optimization, B-35snooping coherence, 359write process, B-11 to B-12
Write update protocol, definition, 356WSCs, see Warehouse-scale
computers (WSCs)
XXBox, L-51Xen Virtual Machine
Amazon Web Services, 456–457characteristics, 111
Xerox Palo Alto Research Center, LAN history, F-99
XIMD architecture, L-34Xon/Xoff, interconnection networks,
F-10, F-17
YYahoo!, WSCs, 465Yield
chip fabrication, 61–62cost trends, 27–32Fermi GTX 480, 324
ZZ-80 microcontroller, cell phones,
E-24Zero condition code, MIPS core, K-9
to K-16Zero-copy protocols
definition, F-8message copying issues, F-91
Zero-load latency, Intel SCCC, F-70
Zuse, Konrad, L-4 to L-5Zynga, FarmVille, 460