bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52...

82
Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture, 15 cache coherence, 361 data cache example, B-12 to B-15, B-13 Google WSC servers, 468–469 inclusion, 398 manufacturing cost, 62 misses per instruction, B-15 MOESI protocol, 362 multicore processor performance, 400–401 multilevel exclusion, B-35 NetApp FAS6000 filer, D-42 paged virtual memory example, B-54 to B-57 vs. Pentium protection, B-57 real-world server considerations, 52–55 server energy savings, 25 snooping limitations, 363–364 SPEC benchmarks, 43 TLB during address translation, B-47 AMD processors architecture flaws vs. success, A-45 GPU computing history, L-52 power consumption, F-85 recent advances, L-33 RISC history, L-22 shared-memory multiprogramming workload, 378 terminology, 313–315 tournament predictors, 164 Virtual Machines, 110 VMMs, 129 Amortization of overhead, sorting case study, D-64 to D-67 AMPS, see Advanced mobile phone service (AMPS) Andreessen, Marc, F-98 Android OS, 324 Annulling delayed branch, instructions, K-25 Antenna, radio receiver, E-23 Antialiasing, address translation, B-38 Antidependences compiler history, L-30 to L-31 definition, 152 finding, H-7 to H-8 loop-level parallelism calculations, 320 MIPS scoreboarding, C-72, C-79 Apogee Software, A-44 Apollo DN 10000, L-30 Apple iPad ARM Cortex-A8, 114 memory hierarchy basics, 78 Application binary interface (ABI), control flow instructions, A-20 Application layer, definition, F-82 Applied Minds, L-74 Arbitration algorithm collision detection, F-23 commercial interconnection networks, F-56 examples, F-49 Intel SCCC, F-70 interconnection networks, F-21 to F-22, F-27, F-49 to F-50 network impact, F-52 to F-55 SAN characteristics, F-76 switched-media networks, F-24 switch microarchitecture, F-57 to F-58 switch microarchitecture pipelining, F-60 system area network history, F-100 Architect-compiler writer relationship, A-29 to A-30 Architecturally visible registers, register renaming vs. ROB, 208–209 Architectural Support for Compilers and Operating Systems (ASPLOS), L-11 Architecture, see also Computer architecture; CUDA (Compute Unified Device Architecture); Instruction set architecture (ISA); Vector architectures compiler writer-architect relationship, A-29 to A-30 definition, 15 heterogeneous, 262 microarchitecture, 15–16, 247–254 stack, A-3, A-27, A-44 to A-45 Areal density, disk storage, D-2 Argument pointer, VAX, K-71 Arithmetic intensity as FP operation, 286, 286–288 Roofline model, 326, 326–327 Arithmetic/logical instructions desktop RISCs, K-11, K-22 embedded RISCs, K-15, K-24 Intel 80x86, K-49, K-53 SPARC, K-31 VAX, B-73 Arithmetic-logical units (ALUs) ARM Cortex-A8, 234, 236 basic MIPS pipeline, C-36 branch condition evaluation, A-19 data forwarding, C-40 to C-41 data hazards requiring stalls, C-19 to C-20 data hazard stall minimization, C-17 to C-19 DSP media extensions, E-10 effective address cycle, C-6 hardware-based execution, 185 hardware-based speculation, 200–201, 201 IA-64 instructions, H-35 immediate operands, A-12 integer division, J-54 integer multiplication, J-48 integer shifting over zeros, J-45 to J-46 Intel Core i7, 238 ISA operands, A-4 to A-5 ISA performance and efficiency prediction, 241 load interlocks, C-39 microarchitectural techniques case study, 253 MIPS operations, A-35, A-37 MIPS pipeline control, C-38 to C-39 MIPS pipeline FP operations, C-52 to C-53 MIPS R4000, C-65 operand forwarding, C-19 operands per instruction example, A-6 parallelism, 45

Transcript of bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52...

Page 1: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-3

AMD Barcelona microprocessor, Google WSC server, 467

AMD Fusion, L-52AMD K-5, L-30AMD Opteron

address translation, B-38Amazon Web Services, 457architecture, 15cache coherence, 361data cache example, B-12 to B-15,

B-13Google WSC servers, 468–469inclusion, 398manufacturing cost, 62misses per instruction, B-15MOESI protocol, 362multicore processor performance,

400–401multilevel exclusion, B-35NetApp FAS6000 filer, D-42paged virtual memory example,

B-54 to B-57vs. Pentium protection, B-57real-world server considerations,

52–55server energy savings, 25snooping limitations, 363–364SPEC benchmarks, 43TLB during address translation,

B-47AMD processors

architecture flaws vs. success, A-45GPU computing history, L-52power consumption, F-85recent advances, L-33RISC history, L-22shared-memory multiprogramming

workload, 378terminology, 313–315tournament predictors, 164Virtual Machines, 110VMMs, 129

Amortization of overhead, sorting case study, D-64 to D-67

AMPS, see Advanced mobile phone service (AMPS)

Andreessen, Marc, F-98Android OS, 324Annulling delayed branch,

instructions, K-25Antenna, radio receiver, E-23

Antialiasing, address translation, B-38Antidependences

compiler history, L-30 to L-31definition, 152finding, H-7 to H-8loop-level parallelism calculations,

320MIPS scoreboarding, C-72, C-79

Apogee Software, A-44Apollo DN 10000, L-30Apple iPad

ARM Cortex-A8, 114memory hierarchy basics, 78

Application binary interface (ABI), control flow instructions, A-20

Application layer, definition, F-82Applied Minds, L-74Arbitration algorithm

collision detection, F-23commercial interconnection

networks, F-56examples, F-49Intel SCCC, F-70interconnection networks, F-21 to

F-22, F-27, F-49 to F-50network impact, F-52 to F-55SAN characteristics, F-76switched-media networks, F-24switch microarchitecture, F-57 to

F-58switch microarchitecture

pipelining, F-60system area network history, F-100

Architect-compiler writer relationship, A-29 to A-30

Architecturally visible registers, register renaming vs. ROB, 208–209

Architectural Support for Compilers and Operating Systems (ASPLOS), L-11

Architecture, see also Computer architecture; CUDA (Compute Unified Device Architecture); Instruction set architecture (ISA); Vector architectures

compiler writer-architect relationship, A-29 to A-30

definition, 15heterogeneous, 262microarchitecture, 15–16, 247–254stack, A-3, A-27, A-44 to A-45

Areal density, disk storage, D-2Argument pointer, VAX, K-71Arithmetic intensity

as FP operation, 286, 286–288Roofline model, 326, 326–327

Arithmetic/logical instructionsdesktop RISCs, K-11, K-22embedded RISCs, K-15, K-24Intel 80x86, K-49, K-53SPARC, K-31VAX, B-73

Arithmetic-logical units (ALUs)ARM Cortex-A8, 234, 236basic MIPS pipeline, C-36branch condition evaluation, A-19data forwarding, C-40 to C-41data hazards requiring stalls, C-19

to C-20data hazard stall minimization,

C-17 to C-19DSP media extensions, E-10effective address cycle, C-6hardware-based execution, 185hardware-based speculation,

200–201, 201IA-64 instructions, H-35immediate operands, A-12integer division, J-54integer multiplication, J-48integer shifting over zeros, J-45 to

J-46Intel Core i7, 238ISA operands, A-4 to A-5ISA performance and efficiency

prediction, 241load interlocks, C-39microarchitectural techniques case

study, 253MIPS operations, A-35, A-37MIPS pipeline control, C-38 to C-39MIPS pipeline FP operations, C-52

to C-53MIPS R4000, C-65operand forwarding, C-19operands per instruction example,

A-6parallelism, 45

Page 2: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-4 ■ Index

Arithmetic-logical units (continued )pipeline branch issues, C-39 to

C-41pipeline execution rate, C-10 to

C-11power/DLP issues, 322RISC architectures, K-5RISC classic pipeline, C-7RISC instruction set, C-4simple MIPS implementation,

C-31 to C-33TX-2, L-49

ARM (Advanced RISC Machine)addressing modes, K-5, K-6arithmetic/logical instructions,

K-15, K-24characteristics, K-4condition codes, K-12 to K-13constant extension, K-9control flow instructions, 14data transfer instructions, K-23embedded instruction format, K-8GPU computing history, L-52ISA class, 11memory addressing, 11multiply-accumulate, K-20operands, 12RISC instruction set lineage, K-43unique instructions, K-36 to K-37

ARM AMBA, OCNs, F-3ARM Cortex-A8

dynamic scheduling, 170ILP concepts, 148instruction decode, 234ISA performance and efficiency

prediction, 241–243memory access penalty, 117memory hierarchy design, 78,

114–117, 115memory performance, 115–117multibanked caches, 86overview, 233pipeline performance, 233–236,

235pipeline structure, 232processor comparison, 242way prediction, 81

ARM Cortex-A9vs. A8 performance, 236Tegra 2, mobile vs. server GPUs,

323–324, 324

ARM Thumbaddressing modes, K-6arithmetic/logical instructions,

K-24characteristics, K-4condition codes, K-14constant extension, K-9data transfer instructions, K-23embedded instruction format, K-8ISAs, 14multiply-accumulate, K-20RISC code size, A-23unique instructions, K-37 to K-38

ARPA (Advanced Research Project Agency)

LAN history, F-99 to F-100WAN history, F-97

ARPANET, WAN history, F-97 to F-98

Array multiplierexample, J-50integers, J-50multipass system, J-51

Arraysaccess age, 91blocking, 89–90bubble sort procedure, K-76cluster server outage/anomaly

statistics, 435examples, 90FFT kernel, I-7Google WSC servers, 469Layer 3 network linkage, 445loop interchange, 88–89loop-level parallelism

dependences, 318–319ocean application, I-9 to I-10recurrences, H-12WSC memory hierarchy, 445WSCs, 443

Array switch, WSCs, 443–444ASC, see Advanced Simulation and

Computing (ASC) program

ASCI, see Accelerated Strategic Computing Initiative (ASCI)

ASCII character format, 12, A-14ASC Purple, F-67, F-100ASI, see Advanced Switching

Interconnect (ASI)

ASPLOS, see Architectural Support for Compilers and Operating Systems (ASPLOS)

Assembly language, 2Association of Computing Machinery

(ACM), L-3Associativity, see also Set

associativitycache block, B-9 to B-10, B-10cache optimization, B-22 to B-24,

B-26, B-28 to B-30cloud computing, 460–461loop-level parallelism, 322multilevel inclusion, 398Opteron data cache, B-14shared-memory multiprocessors,

368Astronautics ZS-1, L-29Asynchronous events, exception

requirements, C-44 to C-45

Asynchronous I/O, storage systems, D-35

Asynchronous Transfer Mode (ATM)interconnection networks, F-89LAN history, F-99packet format, F-75total time statistics, F-90VOQs, F-60as WAN, F-79WAN history, F-98WANs, F-4

ATA (Advanced Technology Attachment) disks

Berkeley’s Tertiary Disk project, D-12

disk storage, D-4historical background, L-81power, D-5RAID 6, D-9server energy savings, 25

Atanasoff, John, L-5Atanasoff Berry Computer (ABC), L-5ATI Radeon 9700, L-51Atlas computer, L-9ATM, see Asynchronous Transfer

Mode (ATM)ATM systems

server benchmarks, 41TP benchmarks, D-18

Page 3: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-5

Atomic exchangelock implementation, 389–390synchronization, 387–388

Atomic instructionsbarrier synchronization, I-14Core i7, 329Fermi GPU, 308T1 multithreading unicore

performance, 229Atomicity-consistency-isolation-durab

ility (ACID), vs. WSC storage, 439

Atomic operationscache coherence, 360–361snooping cache coherence

implementation, 365“Atomic swap,” definition, K-20Attributes field, IA-32 descriptor

table, B-52Autoincrement deferred addressing,

VAX, K-67Autonet, F-48Availability

commercial interconnection networks, F-66

computer architecture, 11, 15computer systems, D-43 to D-44,

D-44data on Internet, 344fault detection, 57–58I/O system design/evaluation,

D-36loop-level parallelism, 217–218mainstream computing classes, 5modules, 34open-source software, 457RAID systems, 60as server characteristic, 7servers, 16source operands, C-74WSCs, 8, 433–435, 438–439

Average instruction execution time, L-6

Average Memory Access Time (AMAT)

block size calculations, B-26 to B-28

cache optimizations, B-22, B-26 to B-32, B-36

cache performance, B-16 to B-21calculation, B-16 to B-17

centralized shared-memory architectures, 351–352

definition, B-30 to B-31memory hierarchy basics, 75–76miss penalty reduction, B-32via miss rates, B-29, B-29 to B-30as processor performance

predictor, B-17 to B-20Average reception factor

centralized switched networks, F-32

multi-device interconnection networks, F-26

AVX, see Advanced Vector Extensions (AVX)

AWS, see Amazon Web Services (AWS)

BBack-off time, shared-media

networks, F-23Backpressure, congestion

management, F-65Backside bus, centralized

shared-memory multiprocessors, 351

Balanced systems, sorting case study, D-64 to D-67

Balanced tree, MINs with nonblicking, F-34

Bandwidth, see also Throughputarbitration, F-49and cache miss, B-2 to B-3centralized shared-memory

multiprocessors, 351–352

communication mechanism, I-3congestion management, F-64 to

F-65Cray Research T3D, F-87DDR DRAMS and DIMMS, 101definition, F-13DSM architecture, 379Ethernet and bridges, F-78FP arithmetic, J-62GDRAM, 322–323GPU computation, 327–328GPU Memory, 327ILP instruction fetch

basic considerations, 202–203branch-target buffers, 203–206

integrated units, 207–208return address predictors,

206–207interconnection networks, F-28

multi-device networks, F-25 to F-29

performance considerations, F-89

two-device networks, F-12 to F-20

vs. latency, 18–19, 19memory, and vector performance,

332memory hierarchy, 126network performance and

topology, F-41OCN history, F-103performance milestones, 20point-to-point links and switches,

D-34routing, F-50 to F-52routing/arbitration/switching

impact, F-52shared- vs. switched-media

networks, F-22SMP limitations, 363switched-media networks, F-24system area network history, F-101vs. TCP/IP reliance, F-95and topology, F-39vector load/store units, 276–277WSC memory hierarchy, 443–444,

444Bandwidth gap, disk storage, D-3Banerjee, Uptal, L-30 to L-31Bank busy time, vector memory

systems, G-9Banked memory, see also Memory

banksand graphics memory, 322–323vector architectures, G-10

Banks, Fermi GPUs, 297Barcelona Supercomputer Center,

F-76Barnes

characteristics, I-8 to I-9distributed-memory

multiprocessor, I-32symmetric shared-memory

multiprocessors, I-22, I-23, I-25

Page 4: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-6 ■ Index

Barnes-Hut n-body algorithm, basic concept, I-8 to I-9

Barrierscommercial workloads, 370Cray X1, G-23fetch-and-increment, I-20 to I-21hardware primitives, 387large-scale multiprocessor

synchronization, I-13 to I-16, I-14, I-16, I-19, I-20

synchronization, 298, 313, 329BARRNet, see Bay Area Research

Network (BARRNet)Based indexed addressing mode, Intel

80x86, K-49, K-58Base field, IA-32 descriptor table,

B-52 to B-53Base station

cell phones, E-23wireless networks, E-22

Basic block, ILP, 149Batch processing workloads

WSC goals/requirements, 433WSC MapReduce and Hadoop,

437–438Bay Area Research Network

(BARRNet), F-80BBN Butterfly, L-60BBN Monarch, L-60Before rounding rule, J-36Benchmarking, see also specific

benchmark suitesdesktop, 38–40EEMBC, E-12embedded applications

basic considerations, E-12power consumption and

efficiency, E-13fallacies, 56instruction set operations, A-15as performance measurement,

37–41real-world server considerations,

52–55response time restrictions, D-18server performance, 40–41sorting case study, D-64 to D-67

Benes topologycentralized switched networks,

F-33

example, F-33BER, see Bit error rate (BER)Berkeley’s Tertiary Disk project

failure statistics, D-13overview, D-12system log, D-43

Berners-Lee, Tim, F-98Bertram, Jack, L-28Best-case lower bounds, multi-device

interconnection networks, F-25

Best-case upper boundsmulti-device interconnection

networks, F-26network performance and

topology, F-41Between instruction exceptions,

definition, C-45Biased exponent, J-15Bidirectional multistage

interconnection networks

Benes topology, F-33characteristics, F-33 to F-34SAN characteristics, F-76

Bidirectional rings, topology, F-35 to F-36

Big Endianinterconnection networks, F-12memory address interpretation,

A-7MIPS core extensions, K-20 to

K-21MIPS data transfers, A-34

Bigtable (Google), 438, 441BINAC, L-5Binary code compatibility

embedded systems, E-15VLIW processors, 196

Binary-coded decimal, definition, A-14Binary-to-decimal conversion, FP

precisions, J-34Bing search

delays and user behavior, 451latency effects, 450–452WSC processor cost-performance,

473Bisection bandwidth

as network cost constraint, F-89network performance and

topology, F-41

NEWS communication, F-42topology, F-39

Bisection bandwidth, WSC array switch, 443

Bisection traffic fraction, network performance and topology, F-41

Bit error rate (BER), wireless networks, E-21

Bit rot, case study, D-61 to D-64Bit selection, block placement, B-7Black box network

basic concept, F-5 to F-6effective bandwidth, F-17performance, F-12switched-media networks, F-24switched network topologies, F-40

Block addressingblock identification, B-7 to B-8interleaved cache banks, 86memory hierarchy basics, 74

Blocked floating point arithmetic, DSP, E-6

Block identificationmemory hierarchy considerations,

B-7 to B-9virtual memory, B-44 to B-45

Blockingbenchmark fallacies, 56centralized switched networks,

F-32direct networks, F-38HOL, see Head-of-line (HOL)

blockingnetwork performance and

topology, F-41Blocking calls, shared-memory

multiprocessor workload, 369

Blocking factor, definition, 90Block multithreading, definition,

L-34Block offset

block identification, B-7 to B-8cache optimization, B-38definition, B-7 to B-8direct-mapped cache, B-9example, B-9main memory, B-44Opteron data cache, B-13, B-13 to

B-14

Page 5: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-7

Block placementmemory hierarchy considerations,

B-7virtual memory, B-44

Block replacementmemory hierarchy considerations,

B-9 to B-10virtual memory, B-45

Blocks, see also Cache block; Thread Block

ARM Cortex-A8, 115vs. bytes per reference, 378compiler optimizations, 89–90definition, B-2disk array deconstruction, D-51,

D-55disk deconstruction case study,

D-48 to D-51global code scheduling, H-15 to

H-16L3 cache size, misses per

instruction, 371LU kernel, I-8memory hierarchy basics, 74memory in cache, B-61placement in main memory,

B-44RAID performance prediction,

D-57 to D-58TI TMS320C55 DSP, E-8uncached state, 384

Block servers, vs. filers, D-34 to D-35Block size

vs. access time, B-28memory hierarchy basics, 76vs. miss rate, B-27

Block transfer engine (BLT)Cray Research T3D, F-87interconnection network

protection, F-87BLT, see Block transfer engine (BLT)Body of Vectorized Loop

definition, 292, 313GPU hardware, 295–296, 311GPU Memory structure, 304NVIDIA GPU, 296SIMD Lane Registers, 314Thread Block Scheduler, 314

Boggs, David, F-99BOMB, L-4Booth recoding, J-8 to J-9, J-9, J-10 to

J-11

chip comparison, J-60 to J-61integer multiplication, J-49

Bose-Einstein formula, definition, 30Bounds checking, segmented virtual

memory, B-52Branch byte, VAX, K-71Branch delay slot

characteristics, C-23 to C-25control hazards, C-41MIPS R4000, C-64scheduling, C-24

Branchescanceling, C-24 to C-25conditional branches, 300–303,

A-17, A-19 to A-20, A-21

control flow instructions, A-16, A-18

delayed, C-23delay slot, C-65IBM 360, K-86 to K-87instructions, K-25MIPS control flow instructions,

A-38MIPS operations, A-35nullifying, C-24 to C-25RISC instruction set, C-5VAX, K-71 to K-72WCET, E-4

Branch folding, definition, 206Branch hazards

basic considerations, C-21penalty reduction, C-22 to C-25pipeline issues, C-39 to C-42scheme performance, C-25 to C-26stall reduction, C-42

Branch history table, basic scheme, C-27 to C-30

Branch offsets, control flow instructions, A-18

Branch penaltyexamples, 205instruction fetch bandwidth,

203–206reduction, C-22 to C-25simple scheme examples, C-25

Branch predictionaccuracy, C-30branch cost reduction, 162–167correlation, 162–164cost reduction, C-26dynamic, C-27 to C-30

early schemes, L-27 to L-28ideal processor, 214ILP exploitation, 201instruction fetch bandwidth, 205integrated instruction fetch units,

207Intel Core i7, 166–167, 239–241misprediction rates on SPEC89, 166static, C-26 to C-27trace scheduling, H-19two-bit predictor comparison, 165

Branch-prediction buffers, basic considerations, C-27 to C-30, C-29

Branch registersIA-64, H-34PowerPC instructions, K-32 to K-33

Branch stalls, MIPS R4000 pipeline, C-67

Branch-target addressbranch hazards, C-42MIPS control flow instructions,

A-38MIPS pipeline, C-36, C-37MIPS R4000, C-25pipeline branches, C-39RISC instruction set, C-5

Branch-target buffersARM Cortex-A8, 233branch hazard stalls, C-42example, 203instruction fetch bandwidth,

203–206instruction handling, 204MIPS control flow instructions,

A-38Branch-target cache, see Branch-target

buffersBrewer, Eric, L-73Bridges

and bandwidth, F-78definition, F-78

Bubblesand deadlock, F-47routing comparison, F-54stall as, C-13

Bubble sort, code example, K-76Buckets, D-26Buffered crossbar switch, switch

microarchitecture, F-62Buffered wormhole switching,

F-51

Page 6: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-8 ■ Index

Buffersbranch-prediction, C-27 to C-30,

C-29branch-target, 203–206, 204, 233,

A-38, C-42DSM multiprocessor cache

coherence, I-38 to I-40Intel SCCC, F-70

interconnection networks, F-10 to F-11

memory, 208MIPS scoreboarding, C-74network interface functions, F-7ROB, 184–192, 188–189, 199,

208–210, 238switch microarchitecture, F-58 to

F-60TLB, see Translation lookaside

buffer (TLB)translation buffer, B-45 to B-46write buffer, B-11, B-14, B-32,

B-35 to B-36Bundles

IA-64, H-34 to H-35, H-37Itanium 2, H-41

Burks, Arthur, L-3Burroughs B5000, L-16Bus-based coherent multiprocessors,

L-59 to L-60Buses

barrier synchronization, I-16cache coherence, 391centralized shared-memory

multiprocessors, 351definition, 351dynamic scheduling with

Tomasulo’s algorithm, 172, 175

Google WSC servers, 469I/O bus replacements, D-34, D-34large-scale multiprocessor

synchronization, I-12 to I-13

NEWS communication, F-42scientific workloads on symmetric

shared-memory multiprocessors, I-25

Sony PlayStation 2 Emotion Engine, E-18

vs. switched networks, F-2switch microarchitecture, F-55 to

F-56

Tomasulo’s algorithm, 180, 182Bypassing, see also Forwarding

data hazards requiring stalls, C-19 to C-20

dynamically scheduled pipelines, C-70 to C-71

MIPS R4000, C-65SAN example, F-74

Byte displacement addressing, VAX, K-67

Byte offsetmisaligned addresses, A-8PTX instructions, 300

Bytesaligned/misaligned addresses, A-8arithmetic intensity example, 286Intel 80x86 integer operations, K-51memory address interpretation,

A-7 to A-8MIPS data transfers, A-34MIPS data types, A-34operand types/sizes, A-14per reference, vs. block size, 378

Byte/word/long displacement deferred addressing, VAX, K-67

CCAC, see Computer aided design

(CAD) toolsCache bandwidth

caches, 78multibanked caches, 85–86nonblocking caches, 83–85pipelined cache access, 82

Cache blockAMD Opteron data cache, B-13,

B-13 to B-14cache coherence protocol, 357–358compiler optimizations, 89–90critical word first, 86–87definition, B-2directory-based cache coherence

protocol, 382–386, 383false sharing, 366GPU comparisons, 329inclusion, 397–398memory block, B-61miss categories, B-26miss rate reduction, B-26 to B-28scientific workloads on symmetric

shared-memory

multiprocessors, I-22, I-25, I-25

shared-memory multiprogramming workload, 375–377, 376

way prediction, 81write invalidate protocol

implementation, 356–357

write strategy, B-10Cache coherence

advanced directory protocol case study, 420–426

basic considerations, 112–113Cray X1, G-22directory-based, see

Directory-based cache coherence

enforcement, 354–355extensions, 362–363hardware primitives, 388Intel SCCC, F-70large-scale multiprocessor history,

L-61large-scale multiprocessors

deadlock and buffering, I-38 to I-40

directory controller, I-40 to I-41

DSM implementation, I-36 to I-37

overview, I-34 to I-36latency hiding with speculation,

396lock implementation, 389–391mechanism, 358memory hierarchy basics, 75multiprocessor-optimized

software, 409multiprocessors, 352–353protocol definitions, 354–355single-chip multicore processor

case study, 412–418single memory location example,

352snooping, see Snooping cache

coherencestate diagram, 361steps and bus traffic examples, 391write-back cache, 360

Cache definition, B-2Cache hit

AMD Opteron example, B-14

Page 7: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-9

definition, B-2example calculation, B-5

Cache latency, nonblocking cache, 83–84

Cache missand average memory access time,

B-17 to B-20block replacement, B-10definition, B-2distributed-memory

multiprocessors, I-32example calculations, 83–84Intel Core i7, 122interconnection network, F-87large-scale multiprocessors, I-34 to

I-35nonblocking cache, 84single vs. multiple thread

executions, 228WCET, E-4

Cache-only memory architecture (COMA), L-61

Cache optimizationsbasic categories, B-22basic optimizations, B-40case studies, 131–133compiler-controlled prefetching,

92–95compiler optimizations, 87–90critical word first, 86–87energy consumption, 81hardware instruction prefetching,

91–92, 92hit time reduction, B-36 to B-40miss categories, B-23 to B-26miss penalty reduction

via multilevel caches, B-30 to B-35

read misses vs. writes, B-35 to B-36

miss rate reductionvia associativity, B-28 to B-30via block size, B-26 to B-28via cache size, B-28

multibanked caches, 85–86, 86nonblocking caches, 83–85, 84overview, 78–79pipelined cache access, 82simple first-level caches, 79–80techniques overview, 96way prediction, 81–82write buffer merging, 87, 88

Cache organizationblocks, B-7, B-8Opteron data cache, B-12 to B-13,

B-13optimization, B-19performance impact, B-19

Cache performanceaverage memory access time, B-16

to B-20basic considerations, B-3 to B-6,

B-16basic equations, B-22basic optimizations, B-40cache optimization, 96case study, 131–133example calculation, B-16 to B-17out-of-order processors, B-20 to

B-22prediction, 125–126

Cache prefetch, cache optimization, 92Caches, see also Memory hierarchy

access time vs. block size, B-28AMD Opteron example, B-12 to

B-15, B-13, B-15basic considerations, B-48 to B-49coining of term, L-11definition, B-2early work, L-10embedded systems, E-4 to E-5Fermi GPU architecture, 306ideal processor, 214ILP for realizable processors,

216–218Itanium 2, H-42multichip multicore

multiprocessor, 419parameter ranges, B-42Sony PlayStation 2 Emotion

Engine, E-18vector processors, G-25vs. virtual memory, B-42 to B-43

Cache sizeand access time, 77AMD Opteron example, B-13 to

B-14energy consumption, 81highly parallel memory systems,

133memory hierarchy basics, 76misses per instruction, 126, 371miss rate, B-24 to B-25vs. miss rate, B-27

miss rate reduction, B-28multilevel caches, B-33and relative execution time, B-34scientific workloads

distributed-memory multiprocessors, I-29 to I-31

symmetric shared-memory multiprocessors, I-22 to I-23, I-24

shared-memory multiprogramming workload, 376

virtually addressed, B-37CACTI

cache optimization, 79–80, 81memory access times, 77

Caller saving, control flow instructions, A-19 to A-20

Call gateIA-32 segment descriptors, B-53segmented virtual memory, B-54

Callscompiler structure, A-25 to A-26control flow instructions, A-17,

A-19 to A-21CUDA Thread, 297dependence analysis, 321high-level instruction set, A-42 to

A-43Intel 80x86 integer operations,

K-51invocation options, A-19ISAs, 14MIPS control flow instructions,

A-38MIPS registers, 12multiprogrammed workload,

378NVIDIA GPU Memory structures,

304–305return address predictors, 206shared-memory multiprocessor

workload, 369user-to-OS gates, B-54VAX, K-71 to K-72

Canceling branch, branch delay slots, C-24 to C-25

Canonical form, AMD64 paged virtual memory, B-55

Capabilities, protection schemes, L-9 to L-10

Page 8: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-10 ■ Index

Capacity missesblocking, 89–90and cache size, B-24definition, B-23memory hierarchy basics, 75scientific workloads on symmetric

shared-memory multiprocessors, I-22, I-23, I-24

shared-memory workload, 373CAPEX, see Capital expenditures

(CAPEX)Capital expenditures (CAPEX)

WSC costs, 452–455, 453WSC Flash memory, 475WSC TCO case study, 476–478

Carrier sensing, shared-media networks, F-23

Carrier signal, wireless networks, E-21

Carry condition code, MIPS core, K-9 to K-16

Carry-in, carry-skip adder, J-42Carry-lookahead adder (CLA)

chip comparison, J-60early computer arithmetic, J-63example, J-38integer addition speedup, J-37 to

J-41with ripple-carry adder, J-42tree, J-40 to J-41

Carry-outcarry-lookahead circuit, J-38floating-point addition speedup,

J-25Carry-propagate adder (CPA)

integer multiplication, J-48, J-51multipass array multiplier, J-51

Carry-save adder (CSA)integer division, J-54 to J-55integer multiplication, J-47 to J-48,

J-48Carry-select adder

characteristics, J-43 to J-44chip comparison, J-60example, J-43

Carry-skip adder (CSA)characteristics, J-41 to J43example, J-42, J-44

CAS, see Column access strobe (CAS)Case statements

control flow instruction addressing modes, A-18

return address predictors, 206Case studies

advanced directory protocol, 420–426

cache optimization, 131–133cell phones

block diagram, E-23Nokia circuit board, E-24overview, E-20radio receiver, E-23standards and evolution, E-25wireless communication

challenges, E-21wireless networks, E-21 to

E-22chip fabrication cost, 61–62computer system power

consumption, 63–64directory-based coherence,

418–420dirty bits, D-61 to D-64disk array deconstruction, D-51 to

D-55, D-52 to D-55disk deconstruction, D-48 to D-51,

D-50highly parallel memory systems,

133–136instruction set principles, A-47 to

A-54I/O subsystem design, D-59 to D-61memory hierarchy, B-60 to B-67microarchitectural techniques,

247–254pipelining example, C-82 to C-88RAID performance prediction,

D-57 to D-59RAID reconstruction, D-55 to

D-57Sanyo VPC-SX500 digital camera,

E-19single-chip multicore processor,

412–418Sony PlayStation 2 Emotion

Engine, E-15 to E-18sorting, D-64 to D-67vector kernel on vector processor

and GPU, 334–336WSC resource allocation, 478–479WSC TCO, 476–478

CCD, see Charge-coupled device (CCD)

C/C++ languagedependence analysis, H-6GPU computing history, L-52hardware impact on software

development, 4integer division/remainder, J-12loop-level parallelism

dependences, 318, 320–321

NVIDIA GPU programming, 289return address predictors, 206

CDB, see Common data bus (CDB)CDC, see Control Data Corporation

(CDC)CDF, datacenter, 487CDMA, see Code division multiple

access (CDMA)Cedar project, L-60Cell, Barnes-Hut n-body algorithm,

I-9Cell phones

block diagram, E-23embedded system case study

characteristics, E-22 to E-24overview, E-20radio receiver, E-23standards and evolution, E-25wireless network overview,

E-21 to E-22Flash memory, D-3GPU features, 324Nokia circuit board, E-24wireless communication

challenges, E-21wireless networks, E-22

Centralized shared-memory multiprocessors

basic considerations, 351–352basic structure, 346–347, 347cache coherence, 352–353cache coherence enforcement,

354–355cache coherence example,

357–362cache coherence extensions,

362–363invalidate protocol

implementation, 356–357

Page 9: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-11

SMP and snooping limitations, 363–364

snooping coherence implementation, 365–366

snooping coherence protocols, 355–356

Centralized switched networksexample, F-31routing algorithms, F-48topology, F-30 to F-34, F-31

Centrally buffered switch, microarchitecture, F-57

Central processing unit (CPU)Amdahl’s law, 48average memory access time, B-17cache performance, B-4coarse-grained multithreading, 224early pipelined versions, L-26 to

L-27exception stopping/restarting, C-47extensive pipelining, C-81Google server usage, 440GPU computing history, L-52vs. GPUs, 288instruction set complications, C-50MIPS implementation, C-33 to

C-34MIPS precise exceptions, C-59 to

C-60MIPS scoreboarding, C-77performance measurement history,

L-6pipeline branch issues, C-41pipelining exceptions, C-43 to

C-46pipelining performance, C-10Sony PlayStation 2 Emotion

Engine, E-17SPEC server benchmarks, 40TI TMS320C55 DSP, E-8vector memory systems, G-10

Central processing unit (CPU) timeexecution time, 36modeling, B-18processor performance

calculations, B-19 to B-21

processor performance equation, 49–51

processor performance time, 49Cerf, Vint, F-97

CERN, see European Center for Particle Research (CERN)

CFM, see Current frame pointer (CFM)

Chainingconvoys, DAXPY code, G-16vector processor performance,

G-11 to G-12, G-12VMIPS, 268–269

Channel adapter, see Network interface

Channels, cell phones, E-24Character

floating-point performance, A-2as operand type, A-13 to A-14operand types/sizes, 12

Charge-coupled device (CCD), Sanyo VPC-SX500 digital camera, E-19

Checksumdirty bits, D-61 to D-64packet format, F-7

ChillersGoogle WSC, 466, 468WSC containers, 464WSC cooling systems, 448–449

Chimedefinition, 309GPUs vs. vector architectures, 308multiple lanes, 272NVIDIA GPU computational

structures, 296vector chaining, G-12vector execution time, 269, G-4vector performance, G-2vector sequence calculations, 270

Chip-crossing wire delay, F-70OCN history, F-103

Chipkillmemory dependability, 104–105WSCs, 473

Choke packets, congestion management, F-65

Chunkdisk array deconstruction, D-51Shear algorithm, D-53

CIFS, see Common Internet File System (CIFS)

Circuit switchingcongestion management, F-64 to

F-65

interconnected networks, F-50Circulating water system (CWS)

cooling system design, 448WSCs, 448

CISC, see Complex Instruction Set Computer (CISC)

CLA, see Carry-lookahead adder (CLA)

Clean block, definition, B-11Climate Savers Computing Initiative,

power supply efficiencies, 462

Clock cyclesbasic MIPS pipeline, C-34 to C-35and branch penalties, 205cache performance, B-4FP pipeline, C-66and full associativity, B-23GPU conditional branching, 303ILP exploitation, 197, 200ILP exposure, 157instruction fetch bandwidth,

202–203instruction steps, 173–175Intel Core i7 branch predictor, 166MIPS exceptions, C-48MIPS pipeline, C-52MIPS pipeline FP operations, C-52

to C-53MIPS scoreboarding, C-77miss rate calculations, B-31 to B-32multithreading approaches,

225–226pipelining performance, C-10processor performance equation, 49RISC classic pipeline, C-7Sun T1 multithreading, 226–227switch microarchitecture

pipelining, F-61vector architectures, G-4vector execution time, 269vector multiple lanes, 271–273VLIW processors, 195

Clock cycles per instruction (CPI)addressing modes, A-10ARM Cortex-A8, 235branch schemes, C-25 to C-26,

C-26cache behavior impact, B-18 to

B-19cache hit calculation, B-5data hazards requiring stalls, C-20

Page 10: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-12 ■ Index

Clock cycles per instruction (continued)extensive pipelining, C-81floating-point calculations, 50–52ILP concepts, 148–149, 149ILP exploitation, 192Intel Core i7, 124, 240, 240–241microprocessor advances, L-33MIPS R4000 performance, C-69miss penalty reduction, B-32multiprocessing/

multithreading-based performance, 398–400

multiprocessor communication calculations, 350

pipeline branch issues, C-41pipeline with stalls, C-12 to C-13pipeline structural hazards, C-15 to

C-16pipelining concept, C-3processor performance

calculations, 218–219processor performance time, 49–51and processor speed, 244RISC history, L-21shared-memory workloads,

369–370simple MIPS implementation,

C-33 to C-34structural hazards, C-13Sun T1 multithreading unicore

performance, 229Sun T1 processor, 399Tomasulo’s algorithm, 181VAX 8700 vs. MIPS M2000, K-82

Clock cycle timeand associativity, B-29average memory access time, B-21

to B-22cache optimization, B-19 to B-20,

B-30cache performance, B-4CPU time equation, 49–50, B-18MIPS implementation, C-34miss penalties, 219pipeline performance, C-12, C-14

to C-15pipelining, C-3shared- vs. switched-media

networks, F-25Clock periods, processor performance

equation, 48–49Clock rate

DDR DRAMS and DIMMS, 101ILP for realizable processors, 218Intel Core i7, 236–237microprocessor advances, L-33microprocessors, 24MIPS pipeline FP operations, C-53multicore processor performance,

400and processor speed, 244

Clocks, processor performance equation, 48–49

Clock skew, pipelining performance, C-10

Clock tickscache coherence, 391processor performance equation,

48–49Clos network

Benes topology, F-33as nonblocking, F-33

Cloud computingbasic considerations, 455–461clusters, 345provider issues, 471–472utility computing history, L-73 to

L-74Clusters

characteristics, 8, I-45cloud computing, 345as computer class, 5containers, L-74 to L-75Cray X1, G-22Google WSC servers, 469historical background, L-62 to

L-64IBM Blue Gene/L, I-41 to I-44,

I-43 to I-44interconnection network domains,

F-3 to F-4Internet Archive Cluster, see

Internet Archive Clusterlarge-scale multiprocessors, I-6large-scale multiprocessor trends,

L-62 to L-63outage/anomaly statistics, 435power consumption, F-85utility computing, L-73 to L-74as WSC forerunners, 435–436,

L-72 to L-73WSC storage, 442–443

Cm*, L-56C.mmp, L-56

CMOSDRAM, 99first vector computers, L-46, L-48ripple-carry adder, J-3vector processors, G-25 to G-27

Coarse-grained multithreading, definition, 224–226

Cocke, John, L-19, L-28Code division multiple access (CDMA),

cell phones, E-25Code generation

compiler structure, A-25 to A-26, A-30

dependences, 220general-purpose register

computers, A-6ILP limitation studies, 220loop unrolling/scheduling, 162

Code schedulingexample, H-16parallelism, H-15 to H-23superblock scheduling, H-21 to

H-23, H-22trace scheduling, H-19 to H-21, H-20

Code sizearchitect-compiler considerations,

A-30benchmark information, A-2comparisons, A-44flawless architecture design, A-45instruciton set encoding, A-22 to A-23ISA and compiler technology,

A-43 to A-44loop unrolling, 160–161multiprogramming, 375–376PMDs, 6RISCs, A-23 to A-24VAX design, A-45VLIW model, 195–196

Coefficient of variance, D-27Coerced exceptions

definition, C-45exception types, C-46

Coherence, see Cache coherenceCoherence misses

definition, 366multiprogramming, 376–377role, 367scientific workloads on symmetric

shared-memory multiprocessors, I-22

snooping protocols, 355–356

Page 11: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-13

Cold-start misses, definition, B-23Collision, shared-media networks, F-23Collision detection, shared-media

networks, F-23Collision misses, definition, B-23Collocation sites, interconnection

networks, F-85COLOSSUS, L-4Column access strobe (CAS), DRAM,

98–99Column major order

blocking, 89stride, 278

COMA, see Cache-only memory architecture (COMA)

Combining tree, large-scale multiprocessor synchronization, I-18

Command queue depth, vs. disk throughput, D-4

Commercial interconnection networkscongestion management, F-64 to

F-66connectivity, F-62 to F-63cross-company interoperability,

F-63 to F-64DECstation 5000 reboots, F-69fault tolerance, F-66 to F-69

Commercial workloadsexecution time distribution, 369symmetric shared-memory

multiprocessors, 367–374

Commit stage, ROB instruction, 186–187, 188

CommoditiesAmazon Web Services, 456–457array switch, 443cloud computing, 455cost vs. price, 32–33cost trends, 27–28, 32Ethernet rack switch, 442HPC hardware, 436shared-memory multiprocessor,

441WSCs, 441

Commodity cluster, characteristics, I-45

Common data bus (CDB)dynamic scheduling with

Tomasulo’s algorithm, 172, 175

FP unit with Tomasulo’s algorithm, 185

reservation stations/register tags, 177

Tomasulo’s algorithm, 180, 182Common Internet File System (CIFS),

D-35NetApp FAS6000 filer, D-41 to

D-42Communication bandwidth, basic

considerations, I-3Communication latency, basic

considerations, I-3 to I-4Communication latency hiding, basic

considerations, I-4Communication mechanism

adaptive routing, F-93 to F-94internetworking, F-81 to F-82large-scale multiprocessors

advantages, I-4 to I-6metrics, I-3 to I-4

multiprocessor communication calculations, 350

network interfaces, F-7 to F-8NEWS communication, F-42 to

F-43SMP limitations, 363

Communication protocol, definition, F-8

Communication subnets, see Interconnection networks

Communication subsystems, see Interconnection networks

Compare instruction, VAX, K-71Compares, MIPS core, K-9 to K-16Compare-select-store unit (CSSU), TI

TMS320C55 DSP, E-8Compiler-controlled prefetching, miss

penalty/rate reduction, 92–95

Compiler optimizationsblocking, 89–90cache optimization, 131–133compiler assumptions, A-25 to

A-26and consistency model, 396loop interchange, 88–89miss rate reduction, 87–90passes, A-25performance impact, A-27

types and classes, A-28Compiler scheduling

data dependences, 151definition, C-71hardware support, L-30 to L-31IBM 360 architecture, 171

Compiler speculation, hardware supportmemory references, H-32overview, H-27preserving exception behavior,

H-28 to H-32Compiler techniques

dependence analysis, H-7global code scheduling, H-17 to

H-18ILP exposure, 156–162vectorization, G-14vector sparse matrices, G-12

Compiler technologyand architecture decisions, A-27 to

A-29Cray X1, G-21 to G-22ISA and code size, A-43 to A-44multimedia instruction support,

A-31 to A-32register allocation, A-26 to A-27structure, A-24 to A-26, A-25

Compiler writer-architect relationship, A-29 to A-30

Complex Instruction Set Computer (CISC)

RISC history, L-22VAX as, K-65

Compulsory missesand cache size, B-24definition, B-23memory hierarchy basics, 75shared-memory workload, 373

Computation-to-communication ratiosparallel programs, I-10 to I-12scaling, I-11

Compute-optimized processors, interconnection networks, F-88

Computer aided design (CAD) tools, cache optimization, 79–80

Computer architecture, see also Architecture

coining of term, K-83 to K-84computer design innovations, 4defining, 11

Page 12: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-14 ■ Index

Computer architecture (continued)definition, L-17 to L-18exceptions, C-44factors in improvement, 2flawless design, K-81flaws and success, K-81floating-point addition, rules, J-24goals/functions requirements, 15,

15–16, 16high-level language, L-18 to L-19instruction execution issues, K-81ISA, 11–15multiprocessor software

development, 407–409parallel, 9–10WSC basics, 432, 441–442

array switch, 443memory hierarchy, 443–446storage, 442–443

Computer arithmeticchip comparison, J-58, J-58 to

J-61, J-59 to J-60floating point

exceptions, J-34 to J-35fused multiply-add, J-32 to J-33IEEE 754, J-16iterative division, J-27 to J-31and memory bandwidth, J-62overview, J-13 to J-14precisions, J-33 to J-34remainder, J-31 to J-32special values, J-16special values and denormals,

J-14 to J-15underflow, J-36 to J-37, J-62

floating-point additiondenormals, J-26 to J-27overview, J-21 to J-25speedup, J-25 to J-26

floating-point multiplicationdenormals, J-20 to J-21examples, J-19overview, J-17 to J-20rounding, J-18

integer addition speedupcarry-lookahead, J-37 to J-41carry-lookahead circuit, J-38carry-lookahead tree, J-40carry-lookahead tree adder,

J-41carry-select adder, J-43, J-43 to

J-44, J-44

carry-skip adder, J-41 to J43, J-42

overview, J-37integer arithmetic

language comparison, J-12overflow, J-11Radix-2 multiplication/

division, J-4, J-4 to J-7

restoring/nonrestoring division, J-6

ripply-carry addition, J-2 to J-3, J-3

signed numbers, J-7 to J-10systems issues, J-10 to J-13

integer divisionradix-2 division, J-55radix-4 division, J-56radix-4 SRT division, J-57with single adder, J-54 to J-58SRT division, J-45 to J-47, J-46

integer-FP conversions, J-62integer multiplication

array multiplier, J-50Booth recoding, J-49even/odd array, J-52with many adders, J-50 to J-54multipass array multiplier, J-51signed-digit addition table,

J-54with single adder, J-47 to J-49,

J-48Wallace tree, J-53

integer multiplication/division, shifting over zeros, J-45 to J-47

overview, J-2rounding modes, J-20

Computer chip fabricationcost case study, 61–62Cray X1E, G-24

Computer classesdesktops, 6embedded computers, 8–9example, 5overview, 5parallelism and parallel

architectures, 9–10PMDs, 6servers, 7and system characteristics, E-4warehouse-scale computers, 8

Computer design principlesAmdahl’s law, 46–48common case, 45–46parallelism, 44–45principle of locality, 45processor performance equation,

48–52Computer history, technology and

architecture, 2–5Computer room air-conditioning

(CRAC), WSC infrastructure, 448–449

Compute tiles, OCNs, F-3Compute Unified Device Architecture,

see CUDA (Compute Unified Device Architecture)

Conditional branchesbranch folding, 206compare frequencies, A-20compiler performance, C-24 to

C-25control flow instructions, 14, A-16,

A-17, A-19, A-21desktop RISCs, K-17embedded RISCs, K-17evaluation, A-19global code scheduling, H-16, H-16GPUs, 300–303ideal processor, 214ISAs, A-46MIPS control flow instructions,

A-38, A-40MIPS core, K-9 to K-16PA-RISC instructions, K-34, K-34predictor misprediction rates, 166PTX instruction set, 298–299static branch prediction, C-26types, A-20vector-GPU comparison, 311

Conditional instructionsexposing parallelism, H-23 to H-27limitations, H-26 to H-27

Condition codesbranch conditions, A-19control flow instructions, 14definition, C-5high-level instruction set, A-43instruction set complications, C-50MIPS core, K-9 to K-16pipeline branch penalties, C-23VAX, K-71

Page 13: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-15

Conflict missesand block size, B-28cache coherence mechanism, 358and cache size, B-24, B-26definition, B-23as kernel miss, 376L3 caches, 371memory hierarchy basics, 75OLTP workload, 370PIDs, B-37shared-memory workload, 373

Congestion controlcommercial interconnection

networks, F-64system area network history, F-101

Congestion management, commercial interconnection networks, F-64 to F-66

Connectednessdimension-order routing, F-47 to

F-48interconnection network topology,

F-29Connection delay, multi-device

interconnection networks, F-25

Connection Machine CM-5, F-91, F-100

Connection Multiprocessor 2, L-44, L-57

Consistency, see Memory consistencyConstant extension

desktop RISCs, K-9embedded RISCs, K-9

Constellation, characteristics, I-45Containers

airflow, 466cluster history, L-74 to L-75Google WSCs, 464–465, 465

Context Switchingdefinition, 106, B-49Fermi GPU, 307

Control bits, messages, F-6Control Data Corporation (CDC), first

vector computers, L-44 to L-45

Control Data Corporation (CDC) 6600computer architecture definition,

L-18dynamically scheduling with

scoreboard, C-71 to C-72

early computer arithmetic, J-64first dynamic scheduling, L-27MIPS scoreboarding, C-75, C-77multiple-issue processor

development, L-28multithreading history, L-34RISC history, L-19

Control Data Corporation (CDC) STAR-100

first vector computers, L-44peak performance vs. start-up

overhead, 331Control Data Corporation (CDC)

STAR processor, G-26Control dependences

conditional instructions, H-24as data dependence, 150global code scheduling, H-16hardware-based speculation,

183ILP, 154–156ILP hardware model, 214and Tomasulo’s algorithm, 170vector mask registers, 275–276

Control flow instructionsaddressing modes, A-17 to A-18basic considerations, A-16 to

A-17, A-20 to A-21classes, A-17conditional branch options, A-19conditional instructions, H-27hardware vs. software speculation,

221Intel 80x86 integer operations, K-51ISAs, 14MIPS, A-37 to A-38, A-38procedure invocation options,

A-19 to A-20Control hazards

ARM Cortex-A8, 235definition, C-11

Control instructionsIntel 80x86, K-53RISCs

desktop systems, K-12, K-22embedded systems, K-16

VAX, B-73Controllers, historical background,

L-80 to L-81Controller transitions

directory-based, 422snooping cache, 421

Control Processordefinition, 309GPUs, 333SIMD, 10Thread Block Scheduler, 294vector processor, 310, 310–311vector unit structure, 273

Conventional datacenters, vs. WSCs, 436

Convex Exemplar, L-61Convex processors, vector processor

history, G-26Convolution, DSP, E-5Convoy

chained, DAXPY code, G-16DAXPY on VMIPS, G-20strip-mined loop, G-5vector execution time, 269–270vector starting times, G-4

Conway, Lynn, L-28Cooling systems

Google WSC, 465–468mechanical design, 448WSC infrastructure, 448–449

Copper wiringEthernet, F-78interconnection networks, F-9

“Coprocessor operations,” MIPS core extensions, K-21

Copy propagation, definition, H-10 to H-11

Core definition, 15Core plus ASIC, embedded systems,

E-3Correlating branch predictors, branch

costs, 162–163Cosmic Cube, F-100, L-60Cost

Amazon EC2, 458Amazon Web Services, 457bisection bandwidth, F-89branch predictors, 162–167, C-26chip fabrication case study, 61–62cloud computing providers,

471–472disk storage, D-2DRAM/magnetic disk, D-3interconnecting node calculations,

F-31 to F-32, F-35Internet Archive Cluster, D-38 to

D-40internetworking, F-80

Page 14: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-16 ■ Index

Cost (continued )I/O system design/evaluation,

D-36magnetic storage history, L-78MapReduce calculations, 458–459,

459memory hierarchy design, 72MINs vs. direct networks, F-92multiprocessor cost relationship,

409multiprocessor linear speedup, 407network topology, F-40PMDs, 6server calculations, 454, 454–455server usage, 7SIMD supercomputer

development, L-43speculation, 210torus topology interconnections,

F-36 to F-38tournament predictors, 164–166WSC array switch, 443WSC vs. datacenters, 455–456WSC efficiency, 450–452WSC facilities, 472WSC network bottleneck, 461WSCs, 446–450, 452–455, 453WSCs vs. servers, 434WSC TCO case study, 476–478

Cost associativity, cloud computing, 460–461

Cost-performancecommercial interconnection

networks, F-63computer trends, 3extensive pipelining, C-80 to C-81IBM eServer p5 processor, 409sorting case study, D-64 to D-67WSC Flash memory, 474–475WSC goals/requirements, 433WSC hardware inactivity, 474WSC processors, 472–473

Cost trendsintegrated circuits, 28–32manufacturing vs. operation, 33overview, 27vs. price, 32–33time, volume, commoditization,

27–28Count register, PowerPC instructions,

K-32 to K-33CP-67 program, L-10

CPA, see Carry-propagate adder (CPA)

CPI, see Clock cycles per instruction (CPI)

CPU, see Central processing unit (CPU)

CRAC, see Computer room air-conditioning (CRAC)

Cray, Seymour, G-25, G-27, L-44, L-47

Cray-1first vector computers, L-44 to L-45peak performance vs. start-up

overhead, 331pipeline depths, G-4RISC history, L-19vector performance, 332vector performance measures, G-16as VMIPS basis, 264, 270–271,

276–277Cray-2

DRAM, G-25first vector computers, L-47tailgating, G-20

Cray-3, G-27Cray-4, G-27Cray C90

first vector computers, L-46, L-48vector performance calculations,

G-8Cray J90, L-48Cray Research T3D, F-86 to F-87,

F-87Cray supercomputers, early computer

arithmetic, J-63 to J-64Cray T3D, F-100, L-60Cray T3E, F-67, F-94, F-100, L-48,

L-60Cray T90, memory bank calculations,

276Cray X1

cluster history, L-63first vector computers, L-46, L-48MSP module, G-22, G-23 to G-24overview, G-21 to G-23peak performance, 58

Cray X1E, F-86, F-91characteristics, G-24

Cray X2, L-46 to L-47first vector computers, L-48 to

L-49

Cray X-MP, L-45first vector computers, L-47

Cray XT3, L-58, L-63Cray XT3 SeaStar, F-63Cray Y-MP

first vector computers, L-45 to L-47

parallel processing debates, L-57vector architecture programming,

281, 281–282CRC, see Cyclic redundancy check

(CRC)Create vector index instruction (CVI),

sparse matrices, G-13Credit-based control flow

InfiniBand, F-74interconnection networks, F-10,

F-17CRISP, L-27Critical path

global code scheduling, H-16trace scheduling, H-19 to H-21, H-20

Critical word first, cache optimization, 86–87

Crossbarscentralized switched networks,

F-30, F-31characteristics, F-73Convex Exemplar, L-61HOL blocking, F-59OCN history, F-104switch microarchitecture, F-62switch microarchitecture

pipelining, F-60 to F-61, F-61

VMIPS, 265Crossbar switch

centralized switched networks, F-30interconnecting node calculations,

F-31 to F-32Cross-company interoperability,

commercial interconnection networks, F-63 to F-64

Crusoe, L-31Cryptanalysis, L-4CSA, see Carry-save adder (CSA);

Carry-skip adder (CSA)C# language, hardware impact on

software development, 4CSSU, see Compare-select-store unit

(CSSU)

Page 15: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-17

CUDA (Compute Unified Device Architecture)

GPU computing history, L-52GPU conditional branching, 303GPUs vs. vector architectures,

310NVIDIA GPU programming,

289PTX, 298, 300sample program, 289–290SIMD instructions, 297terminology, 313–315

CUDA ThreadCUDA programming model, 300,

315definition, 292, 313definitions and terms, 314GPU data addresses, 310GPU Memory structures, 304NVIDIA parallelism, 289–290vs. POSIX Threads, 297PTX Instructions, 298SIMD Instructions, 303Thread Block, 313

Current frame pointer (CFM), IA-64 register model, H-33 to H-34

Custom clustercharacteristics, I-45IBM Blue Gene/L, I-41 to I-44,

I-43 to I-44Cut-through packet switching, F-51

routing comparison, F-54CVI, see Create vector index

instruction (CVI)CWS, see Circulating water system

(CWS)CYBER 180/990, precise exceptions,

C-59CYBER 205

peak performance vs. start-up overhead, 331

vector processor history, G-26 to G-27

CYBER 250, L-45Cycles, processor performance

equation, 49Cycle time, see also Clock cycle time

CPI calculations, 350pipelining, C-81scoreboarding, C-79vector processors, 277

Cyclic redundancy check (CRC)IBM Blue Gene/L 3D torus

network, F-73network interface, F-8

Cydrome Cydra 6, L-30, L-32

DDaCapo benchmarks

ISA, 242SMT, 230–231, 231

DAMQs, see Dynamically allocatable multi-queues (DAMQs)

DASH multiprocessor, L-61Database program speculation, via

multiple branches, 211Data cache

ARM Cortex-A8, 236cache optimization, B-33, B-38cache performance, B-16GPU Memory, 306ISA, 241locality principle, B-60MIPS R4000 pipeline, C-62 to

C-63multiprogramming, 374page level write-through, B-56RISC processor, C-7structural hazards, C-15TLB, B-46

Data cache missapplications vs. OS, B-59cache optimization, B-25Intel Core i7, 240Opteron, B-12 to B-15sizes and associativities, B-10writes, B-10

Data cache size, multiprogramming, 376–377

DatacentersCDF, 487containers, L-74cooling systems, 449layer 3 network example, 445PUE statistics, 451tier classifications, 491vs. WSC costs, 455–456WSC efficiency measurement,

450–452vs. WSCs, 436

Data dependencesconditional instructions, H-24data hazards, 167–168

dynamically scheduling with scoreboard, C-71

example calculations, H-3 to H-4hazards, 153–154ILP, 150–152ILP hardware model, 214–215ILP limitation studies, 220vector execution time, 269

Data fetchingARM Cortex-A8, 234directory-based cache coherence

protocol example, 382–383

dynamically scheduled pipelines, C-70 to C-71

ILP, instruction bandwidthbasic considerations, 202–203branch-target buffers, 203–206return address predictors,

206–207MIPS R4000, C-63snooping coherence protocols,

355–356Data flow

control dependence, 154–156dynamic scheduling, 168global code scheduling, H-17ILP limitation studies, 220limit, L-33

Data flow execution, hardware-based speculation, 184

Datagrams, see PacketsData hazards

ARM Cortex-A8, 235basic considerations, C-16definition, C-11dependences, 152–154dynamic scheduling, 167–176

basic concept, 168–170examples, 176–178Tomasulo’s algorithm,

170–176, 178–179Tomasulo’s algorithm

loop-based example, 179–181

ILP limitation studies, 220instruction set complications, C-50

to C-51microarchitectural techniques case

study, 247–254MIPS pipeline, C-71RAW, C-57 to C-58

Page 16: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-18 ■ Index

Data hazardsstall minimization by forwarding,

C-16 to C-19, C-18stall requirements, C-19 to C-21VMIPS, 264

Data-level parallelism (DLP)definition, 9GPUs

basic considerations, 288basic PTX thread instructions,

299conditional branching, 300–303coprocessor relationship,

330–331Fermi GPU architecture

innovations, 305–308Fermi GTX 480 floorplan, 295mapping examples, 293Multimedia SIMD comparison,

312multithreaded SIMD Processor

block diagram, 294NVIDIA computational

structures, 291–297NVIDIA/CUDA and AMD

terminology, 313–315NVIDIA GPU ISA, 298–300NVIDIA GPU Memory

structures, 304, 304–305programming, 288–291SIMD thread scheduling, 297terminology, 292vs. vector architectures,

308–312, 310from ILP, 4–5Multimedia SIMD Extensions

basic considerations, 282–285programming, 285roofline visual performance

model, 285–288, 287and power, 322vector architecture

basic considerations, 264gather/scatter operations,

279–280multidimensional arrays,

278–279multiple lanes, 271–273peak performance vs. start-up

overhead, 331programming, 280–282

vector execution time, 268–271vector-length registers,

274–275vector load-store unit

bandwidth, 276–277vector-mask registers, 275–276vector processor example,

267–268VMIPS, 264–267

vector kernel implementation, 334–336

vector performance and memory bandwidth, 332

vector vs. scalar performance, 331–332

WSCs vs. servers, 433–434Data link layer

definition, F-82interconnection networks, F-10

Data parallelism, SIMD computer history, L-55

Data-race-free, synchronized programs, 394

Data races, synchronized programs, 394Data transfers

cache miss rate calculations, B-16computer architecture, 15desktop RISC instructions, K-10,

K-21embedded RISCs, K-14, K-23gather-scatter, 281, 291instruction operators, A-15Intel 80x86, K-49, K-53 to K-54ISA, 12–13MIPS, addressing modes, A-34MIPS64, K-24 to K-26MIPS64 instruction subset, A-40MIPS64 ISA formats, 14MIPS core extensions, K-20MIPS operations, A-36 to A-37MMX, 283multimedia instruction compiler

support, A-31operands, A-12PTX, 305SIMD extensions, 284“typical” programs, A-43VAX, B-73vector vs. GPU, 300

Data trunks, MIPS scoreboarding, C-75

Data typesarchitect-compiler writer

relationship, A-30dependence analysis, H-10desktop computing, A-2Intel 80x86, K-50MIPS, A-34, A-36MIPS64 architecture, A-34multimedia compiler support, A-31operand types/sizes, A-14 to A-15SIMD Multimedia Extensions,

282–283SPARC, K-31VAX, K-66, K-70

Dauber, Phil, L-28DAXPY loop

chained convoys, G-16on enhanced VMIPS, G-19 to G-21memory bandwidth, 332MIPS/VMIPS calculations,

267–268peak performance vs. start-up

overhead, 331vector performance measures,

G-16VLRs, 274–275on VMIPS, G-19 to G-20VMIPS calculations, G-18VMIPS on Linpack, G-18VMIPS peak performance, G-17

D-cachescase study examples, B-63way prediction, 81–82

DDR, see Double data rate (DDR)Deadlock

cache coherence, 361dimension-order routing, F-47 to

F-48directory protocols, 386Intel SCCC, F-70large-scale multiprocessor cache

coherence, I-34 to I-35, I-38 to I-40

mesh network routing, F-46network routing, F-44routing comparison, F-54synchronization, 388system area network history, F-101

Deadlock avoidancemeshes and hypercubes, F-47routing, F-44 to F-45

Page 17: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-19

Deadlock recovery, routing, F-45Dead time

vector pipeline, G-8vector processor, G-8

Decimal operands, formats, A-14Decimal operations, PA-RISC

instructions, K-35Decision support system (DSS),

shared-memory workloads, 368–369, 369, 369–370

Decoder, radio receiver, E-23Decode stage, TI 320C55 DSP, E-7DEC PDP-11, address space, B-57 to

B-58DECstation 5000, reboot

measurements, F-69DEC VAX

addressing modes, A-10 to A-11, A-11, K-66 to K-68

address space, B-58architect-compiler writer

relationship, A-30branch conditions, A-19branches, A-18

jumps, procedure calls, K-71 to K-72

bubble sort, K-76characteristics, K-42cluster history, L-62, L-72compiler writing-architecture

relationship, A-30control flow instruction branches,

A-18data types, K-66early computer arithmetic, J-63 to

J-64early pipelined CPUs, L-26exceptions, C-44extensive pipelining, C-81failures, D-15flawless architecture design, A-45,

K-81high-level instruction set, A-41 to

A-43high-level language computer

architecture, L-18 to L-19history, 2–3immediate value distribution, A-13instruction classes, B-73instruction encoding, K-68 to

K-70, K-69

instruction execution issues, K-81instruction operator categories,

A-15instruction set complications, C-49

to C-50integer overflow, J-11vs. MIPS, K-82vs. MIPS32 sort, K-80vs. MIPS code, K-75miss rate vs. virtual addressing,

B-37operands, K-66 to K-68operand specifiers, K-68operands per ALU, A-6, A-8operand types/sizes, A-14operation count, K-70 to K-71operations, K-70 to K-72operators, A-15overview, K-65 to K-66precise exceptions, C-59replacement by RISC, 2RISC history, L-20 to L-21RISC instruction set lineage, K-43sort, K-76 to K-79sort code, K-77 to K-79sort register allocation, K-76swap, K-72 to K-76swap code, B-74, K-72, K-74swap full procedure, K-75 to K-76swap and register preservation,

B-74 to B-75unique instructions, K-28

DEC VAX-11/780, L-6 to L-7, L-11, L-18

DEC VAX 8700vs. MIPS M2000, K-82, L-21RISC history, L-21

Dedicated link networkblack box network, F-5 to F-6effective bandwidth, F-17example, F-6

Defect tolerance, chip fabrication cost case study, 61–62

Deferred addressing, VAX, K-67Delayed branch

basic scheme, C-23compiler history, L-31instructions, K-25stalls, C-65

Dell Poweredge servers, prices, 53Dell Poweredge Thunderbird, SAN

characteristics, F-76

Dell serverseconomies of scale, 456real-world considerations, 52–55WSC services, 441

Demodulator, radio receiver, E-23Denormals, J-14 to J-16, J-20 to

J-21floating-point additions, J-26 to

J-27floating-point underflow, J-36

Dense matrix multiplication, LU kernel, I-8

Density-optimized processors, vs. SPEC-optimized, F-85

Dependabilitybenchmark examples, D-21 to

D-23, D-22definition, D-10 to D-11disk operators, D-13 to D-15integrated circuits, 33–36Internet Archive Cluster, D-38 to

D-40memory systems, 104–105WSC goals/requirements, 433WSC memory, 473–474WSC storage, 442–443

Dependence analysisbasic approach, H-5example calculations, H-7limitations, H-8 to H-9

Dependence distance, loop-carried dependences, H-6

Dependencesantidependences, 152, 320, C-72,

C-79CUDA, 290as data dependence, 150data hazards, 167–168definition, 152–153, 315–316dynamically scheduled pipelines,

C-70 to C-71dynamically scheduling with

scoreboard, C-71dynamic scheduling with

Tomasulo’s algorithm, 172

hardware-based speculation, 183

hazards, 153–154ILP, 150–156ILP hardware model, 214–215ILP limitation studies, 220

Page 18: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-20 ■ Index

Dependences (continued)loop-level parallelism, 318–322,

H-3dependence analysis, H-6 to H-10

MIPS scoreboarding, C-79as program properties, 152sparse matrices, G-13and Tomasulo’s algorithm, 170types, 150vector execution time, 269vector mask registers, 275–276VMIPS, 268

Dependent computations, elimination, H-10 to H-12

Descriptor privilege level (DPL), segmented virtual memory, B-53

Descriptor table, IA-32, B-52Design faults, storage systems, D-11Desktop computers

characteristics, 6compiler structure, A-24as computer class, 5interconnection networks, F-85memory hierarchy basics, 78multimedia support, E-11multiprocessor importance, 344performance benchmarks, 38–40processor comparison, 242RAID history, L-80RISC systems

addressing modes, K-5addressing modes and

instruction formats, K-5 to K-6

arithmetic/logical instructions, K-22

conditional branches, K-17constant extension, K-9control instructions, K-12conventions, K-13data transfer instructions, K-10,

K-21examples, K-3, K-4features, K-44FP instructions, K-13, K-23instruction formats, K-7multimedia extensions, K-16 to

K-19, K-18system characteristics, E-4

Destination offset, IA-32 segment, B-53

Deterministic routing algorithmvs. adaptive routing, F-52 to F-55,

F-54DOR, F-46

Diesembedded systems, E-15integrated circuits, 28–30, 29Nehalem floorplan, 30wafer example, 31, 31–32

Die yield, basic equation, 30–31Digital Alpha

branches, A-18conditional instructions, H-27early pipelined CPUs, L-27RISC history, L-21RISC instruction set lineage, K-43synchronization history, L-64

Digital Alpha 21064, L-48Digital Alpha 21264

cache hierarchy, 368floorplan, 143

Digital Alpha MAXcharacteristics, K-18multimedia support, K-18

Digital Alpha processorsaddressing modes, K-5arithmetic/logical instructions, K-11branches, K-21conditional branches, K-12, K-17constant extension, K-9control flow instruction branches,

A-18conventions, K-13data transfer instructions, K-10displacement addressing mode,

A-12exception stopping/restarting, C-47FP instructions, K-23immediate value distribution, A-13MAX, multimedia support, E-11MIPS precise exceptions, C-59multimedia support, K-19recent advances, L-33as RISC systems, K-4shared-memory workload,

367–369unique instructions, K-27 to K-29

Digital Linear Tape, L-77Digital signal processor (DSP)

cell phones, E-23, E-23, E-23 to E-24

definition, E-3

desktop multimedia support, E-11embedded RISC extensions, K-19examples and characteristics, E-6media extensions, E-10 to E-11overview, E-5 to E-7saturating operations, K-18 to

K-19TI TMS320C6x, E-8 to E-10TI TMS320C6x instruction packet,

E-10TI TMS320C55, E-6 to E-7, E-7 to

E-8TI TMS320C64x, E-9

Dimension-order routing (DOR), definition, F-46

DIMMs, see Dual inline memory modules (DIMMs)

Direct attached disks, definition, D-35Direct-mapped cache

address parts, B-9address translation, B-38block placement, B-7early work, L-10memory hierarchy basics, 74memory hierarchy, B-48optimization, 79–80

Direct memory access (DMA)historical background, L-81InfiniBand, F-76network interface functions,

F-7Sanyo VPC-SX500 digital camera,

E-19Sony PlayStation 2 Emotion

Engine, E-18TI TMS320C55 DSP, E-8zero-copy protocols, F-91

Direct networkscommercial system topologies,

F-37vs. high-dimensional networks,

F-92vs. MIN costs, F-92topology, F-34 to F-40

Directory-based cache coherenceadvanced directory protocol case

study, 420–426basic considerations, 378–380case study, 418–420definition, 354distributed-memory

multiprocessor, 380

Page 19: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-21

large-scale multiprocessor history, L-61

latencies, 425protocol basics, 380–382protocol example, 382–386state transition diagram, 383

Directory-based multiprocessorcharacteristics, I-31performance, I-26scientific workloads, I-29synchronization, I-16, I-19 to I-20

Directory controller, cache coherence, I-40 to I-41

Dirty bitcase study, D-61 to D-64definition, B-11virtual memory fast address

translation, B-46Dirty block

definition, B-11read misses, B-36

Discrete cosine transform, DSP, E-5Disk arrays

deconstruction case study, D-51 to D-55, D-52 to D-55

RAID 6, D-8 to D-9RAID 10, D-8RAID levels, D-6 to D-8, D-7

Disk layout, RAID performance prediction, D-57 to D-59

Disk power, basic considerations, D-5Disk storage

access time gap, D-3areal density, D-2 to D-5cylinders, D-5deconstruction case study, D-48 to

D-51, D-50DRAM/magnetic disk cost vs.

access time, D-3intelligent interfaces, D-4internal microprocessors, D-4real faults and failures, D-10 to

D-11throughput vs. command queue

depth, D-4Disk technology

failure rate calculation, 48Google WSC servers, 469performance trends, 19–20, 20WSC Flash memory, 474–475

Dispatch stageinstruction steps, 174

microarchitectural techniques case study, 247–254

Displacement addressing modebasic considerations, A-10MIPS, 12MIPS data transfers, A-34MIPS instruction format, A-35value distributions, A-12VAX, K-67

Display lists, Sony PlayStation 2 Emotion Engine, E-17

Distributed routing, basic concept, F-48

Distributed shared memory (DSM)basic considerations, 378–380basic structure, 347–348, 348characteristics, I-45directory-based cache coherence,

354, 380, 418–420multichip multicore

multiprocessor, 419snooping coherence protocols,

355Distributed shared-memory

multiprocessorscache coherence implementation,

I-36 to I-37scientific application performance,

I-26 to I-32, I-28 to I-32Distributed switched networks,

topology, F-34 to F-40Divide operations

chip comparison, J-60 to J-61floating-point, stall, C-68floating-point iterative, J-27 to

J-31integers, speedup

radix-2 division, J-55radix-4 division, J-56radix-4 SRT division, J-57with single adder, J-54 to J-58

integer shifting over zeros, J-45 to J-47

language comparison, J-12n-bit unsigned integers, J-4PA-RISC instructions, K-34 to

K-35Radix-2, J-4 to J-7restoring/nonrestoring, J-6SRT division, J-45 to J-47, J-46unfinished instructions, 179

DLP, see Data-level parallelism (DLP)

DLXinteger arithmetic, J-12vs. Intel 80x86 operations, K-62,

K-63 to K-64DMA, see Direct memory access

(DMA)DOR, see Dimension-order routing

(DOR)Double data rate (DDR)

ARM Cortex-A8, 117DRAM performance, 100DRAMs and DIMMS, 101Google WSC servers, 468–469IBM Blue Gene/L, I-43InfiniBand, F-77Intel Core i7, 121SDRAMs, 101

Double data rate 2 (DDR2), SDRAM timing diagram, 139

Double data rate 3 (DDR3)DRAM internal organization, 98GDRAM, 102Intel Core i7, 118SDRAM power consumption, 102,

103Double data rate 4 (DDR4), DRAM,

99Double data rate 5 (DDR5), GDRAM,

102Double-extended floating-point

arithmetic, J-33 to J-34Double failures, RAID reconstruction,

D-55 to D-57Double-precision floating point

add-divide, C-68AVX for x86, 284chip comparison, J-58data access benchmarks, A-15DSP media extensions, E-10 to

E-11Fermi GPU architecture, 306floating-point pipeline, C-65GTX 280, 325, 328–330IBM 360, 171MIPS, 285, A-38 to A-39MIPS data transfers, A-34MIPS registers, 12, A-34Multimedia SIMD vs. GPUs, 312operand sizes/types, 12as operand type, A-13 to A-14operand usage, 297pipeline timing, C-54

Page 20: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-22 ■ Index

Double-precision (continued )Roofline model, 287, 326SIMD Extensions, 283VMIPS, 266, 266–267

Double roundingFP precisions, J-34FP underflow, J-37

Double wordsaligned/misaligned addresses, A-8data access benchmarks, A-15Intel 80x86, K-50memory address interpretation,

A-7 to A-8MIPS data types, A-34operand types/sizes, 12, A-14stride, 278

DPL, see Descriptor privilege level (DPL)

DRAM, see Dynamic random-access memory (DRAM)

DRDRAM, Sony PlayStation 2, E-16 to E-17

Driver domains, Xen VM, 111DSM, see Distributed shared memory

(DSM)DSP, see Digital signal processor

(DSP)DSS, see Decision support system

(DSS)Dual inline memory modules (DIMMs)

clock rates, bandwidth, names, 101DRAM basics, 99Google WSC server, 467Google WSC servers, 468–469graphics memory, 322–323Intel Core i7, 118, 121Intel SCCC, F-70SDRAMs, 101WSC memory, 473–474

Dual SIMD Thread Scheduler, example, 305–306

DVFS, see Dynamic voltage-frequency scaling (DVFS)

Dynamically allocatable multi-queues (DAMQs), switch microarchitecture, F-56 to F-57

Dynamically scheduled pipelinesbasic considerations, C-70 to C-71with scoreboard, C-71 to C-80

Dynamically shared libraries, control flow instruction addressing modes, A-18

Dynamic energy, definition, 23Dynamic network reconfiguration,

fault tolerance, F-67 to F-68

Dynamic powerenergy efficiency, 211microprocessors, 23vs. static power, 26

Dynamic random-access memory (DRAM)

bandwidth issues, 322–323characteristics, 98–100clock rates, bandwidth, names, 101cost vs. access time, D-3cost trends, 27Cray X1, G-22CUDA, 290dependability, 104disk storage, D-3 to D-4embedded benchmarks, E-13errors and faults, D-11first vector computers, L-45, L-47Flash memory, 103–104Google WSC servers, 468–469GPU SIMD instructions, 296IBM Blue Gene/L, I-43 to I-44improvement over time, 17integrated circuit costs, 28Intel Core i7, 121internal organization, 98magnetic storage history, L-78memory hierarchy design, 73, 73memory performance, 100–102multibanked caches, 86NVIDIA GPU Memory structures,

305performance milestones, 20power consumption, 63real-world server considerations,

52–55Roofline model, 286server energy savings, 25Sony PlayStation 2, E-16, E-17speed trends, 99technology trends, 17vector memory systems, G-9vector processor, G-25WSC efficiency measurement, 450

WSC memory costs, 473–474WSC memory hierarchy, 444–445WSC power modes, 472yield, 32

Dynamic schedulingfirst use, L-27ILP

basic concept, 168–169definition, 168example and algorithms,

176–178with multiple issue and

speculation, 197–202overcoming data hazards,

167–176Tomasulo’s algorithm, 170–176,

178–179, 181–183MIPS scoreboarding, C-79SMT on superscalar processors, 230and unoptimized code, C-81

Dynamic voltage-frequency scaling (DVFS)

energy efficiency, 25Google WSC, 467processor performance equation,

52Dynamo (Amazon), 438, 452

EEarly restart, miss penalty reduction,

86Earth Simulator, L-46, L-48, L-63EBS, see Elastic Block Storage (EBS)EC2, see Amazon Elastic Computer

Cloud (EC2)ECC, see Error-Correcting Code

(ECC)Eckert, J. Presper, L-2 to L-3, L-5, L-19Eckert-Mauchly Computer

Corporation, L-4 to L-5, L-56

ECL minicomputer, L-19Economies of scale

WSC vs. datacenter costs, 455–456WSCs, 434

EDSAC (Electronic Delay Storage Automatic Calculator), L-3

EDVAC (Electronic Discrete Variable Automatic Computer), L-2 to L-3

Page 21: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-23

EEMBC, see Electronic Design News Embedded Microprocessor Benchmark Consortium (EEMBC)

EEPROM (Electronically Erasable Programmable Read-Only Memory)

compiler-code size considerations, A-44

Flash Memory, 102–104memory hierarchy design, 72

Effective addressALU, C-7, C-33data dependences, 152definition, A-9execution/effective address cycle,

C-6, C-31 to C-32, C-63

hardware-based speculation, 186, 190, 192

load interlocks, C-39load-store, 174, 176, C-4RISC instruction set, C-4 to C-5simple MIPS implementation,

C-31 to C-32simple RISC implementation,

C-6TLB, B-49Tomasulo’s algorithm, 173, 178,

182Effective bandwidth

definition, F-13example calculations, F-18vs. interconnected nodes, F-28interconnection networks

multi-device networks, F-25 to F-29

two-device networks, F-12 to F-20

vs. packet size, F-19Efficiency factor, F-52Eight-way set associativity

ARM Cortex-A8, 114cache optimization, B-29conflict misses, B-23data cache misses, B-10

Elapsed time, execution time, 36Elastic Block Storage (EBS),

MapReduce cost calculations, 458–460, 459

Electronically Erasable Programmable Read-Only Memory, see EEPROM (Electronically Erasable Programmable Read-Only Memory)

Electronic Delay Storage Automatic Calculator (EDSAC), L-3

Electronic Design News Embedded Microprocessor Benchmark Consortium (EEMBC)

benchmark classes, E-12ISA code size, A-44kernel suites, E-12performance benchmarks, 38power consumption and efficiency

metrics, E-13Electronic Discrete Variable

Automatic Computer (EDVAC), L-2 to L-3

Electronic Numerical Integrator and Calculator (ENIAC), L-2 to L-3, L-5 to L-6, L-77

Element group, definition, 272Embedded multiprocessors,

characteristics, E-14 to E-15

Embedded systemsbenchmarks

basic considerations, E-12power consumption and

efficiency, E-13cell phone case study

Nokia circuit board, E-24overview, E-20phone block diagram, E-23phone characteristics, E-22 to

E-24radio receiver, E-23standards and evolution, E-25wireless networks, E-21 to

E-22characteristics, 8–9, E-4as computer class, 5digital signal processors

definition, E-3desktop multimedia support,

E-11examples and characteristics,

E-6

media extensions, E-10 to E-11overview, E-5 to E-7TI TMS320C6x, E-8 to E-10TI TMS320C6x instruction

packet, E-10TI TMS320C55, E-6 to E-7,

E-7 to E-8TI TMS320C64x, E-9

EEMBC benchmark suite, E-12overview, E-2performance, E-13 to E-14real-time processing, E-3 to E-5RISC systems

addressing modes, K-6addressing modes and

instruction formats, K-5 to K-6

arithmetic/logical instructions, K-24

conditional branches, K-17constant extension, K-9control instructions, K-16conventions, K-16data transfer instructions, K-14,

K-23DSP extensions, K-19examples, K-3, K-4instruction formats, K-8multiply-accumulate, K-20

Sanyo digital camera SOC, E-20Sanyo VPC-SX500 digital camera

case study, E-19Sony PlayStation 2 block diagram,

E-16Sony PlayStation 2 Emotion

Engine case study, E-15 to E-18

Sony PlayStation 2 Emotion Engine organization, E-18

EMC, L-80Emotion Engine

organization modes, E-18Sony PlayStation 2 case study,

E-15 to E-18empowerTel Networks, MXP

processor, E-14Encoding

control flow instructions, A-18erasure encoding, 439instruction set, A-21 to A-24, A-22Intel 80x86 instructions, K-55, K-58

Page 22: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-24 ■ Index

Encoding (continued )ISAs, 14, A-5 to A-6MIPS ISA, A-33MIPS pipeline, C-36opcode, A-13VAX instructions, K-68 to K-70,

K-69VLIW model, 195–196

Encore Multimax, L-59End-to-end flow control

congestion management, F-65vs. network-only features, F-94 to

F-95Energy efficiency, see also Power

consumptionClimate Savers Computing

Initiative, 462embedded benchmarks, E-13hardward fallacies, 56ILP exploitation, 201Intel Core i7, 401–405ISA, 241–243microprocessor, 23–26PMDs, 6processor performance equation, 52servers, 25and speculation, 211–212system trends, 21–23WSC, measurement, 450–452WSC goals/requirements, 433WSC infrastructure, 447–449WSC servers, 462–464

Energy proportionality, WSC servers, 462

Engineering Research Associates (ERA), L-4 to L-5

ENIAC (Electronic Numerical Integrator and Calculator), L-2 to L-3, L-5 to L-6, L-77

Enigma coding machine, L-4Entry time, transactions, D-16, D-17Environmental faults, storage systems,

D-11EPIC approach

historical background, L-32IA-64, H-33VLIW processors, 194, 196

Equal condition code, PowerPC, K-10 to K-11

ERA, see Engineering Research Associates (ERA)

Erasure encoding, WSCs, 439Error-Correcting Code (ECC)

disk storage, D-11fault detection pitfalls, 58Fermi GPU architecture, 307hardware dependability, D-15memory dependability, 104RAID 2, D-6and WSCs, 473–474

Error handling, interconnection networks, F-12

Errors, definition, D-10 to D-11Escape resource set, F-47ETA processor, vector processor

history, G-26 to G-27Ethernet

and bandwidth, F-78commercial interconnection

networks, F-63cross-company interoperability, F-64interconnection networks, F-89as LAN, F-77 to F-79LAN history, F-99LANs, F-4packet format, F-75shared-media networks, F-23shared- vs. switched-media

networks, F-22storage area network history,

F-102switch vs. NIC, F-86system area networks, F-100total time statistics, F-90WAN history, F-98

Ethernet switchesarchitecture considerations, 16Dell servers, 53Google WSC, 464–465, 469historical performance milestones,

20WSCs, 441–444

European Center for Particle Research (CERN), F-98

Even/odd arrayexample, J-52integer multiplication, J-52

EVEN-ODD scheme, development, D-10

EX, see Execution address cycle (EX)Example calculations

average memory access time, B-16 to B-17

barrier synchronization, I-15block size and average memory

access time, B-26 to B-28branch predictors, 164branch schemes, C-25 to C-26branch-target buffer branch

penalty, 205–206bundles, H-35 to H-36cache behavior impact, B-18, B-21cache hits, B-5cache misses, 83–84, 93–95cache organization impact, B-19 to

B-20carry-lookahead adder, J-39chime approximation, G-2compiler-based speculation, H-29

to H-31conditional instructions, H-23 to

H-24CPI and FP, 50–51credit-based control flow, F-10 to

F-11crossbar switch interconnections,

F-31 to F-32data dependences, H-3 to H-4DAXPY on VMIPS, G-18 to G-20dependence analysis, H-7 to H-8deterministic vs. adaptive routing,

F-52 to F-55dies, 29die yield, 31dimension-order routing, F-47 to

F-48disk subsystem failure rates, 48fault tolerance, F-68fetch-and-increment barrier, I-20

to I-21FFT, I-27 to I-29fixed-point arithmetic, E-5 to E-6floating-point addition, J-24 to J-25floating-point square root, 47–48GCD test, 319, H-7geometric means, 43–44hardware-based speculation,

200–201inclusion, 397information tables, 176–177integer multiplication, J-9interconnecting node costs, F-35interconnection network latency

and effective bandwidth, F-26 to F-28

Page 23: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-25

I/O system utilization, D-26L1 cache speed, 80large-scale multiprocessor locks,

I-20large-scale multiprocessor

synchronization, I-12 to I-13

loop-carried dependences, 316, H-4 to H-5

loop-level parallelism, 317loop-level parallelism

dependences, 320loop unrolling, 158–160MapReduce cost on EC2, 458–460memory banks, 276microprocessor dynamic energy/

power, 23MIPS/VMIPS for DAXPY loop,

267–268miss penalty, B-33 to B-34miss rates, B-6, B-31 to B-32miss rates and cache sizes, B-29 to

B-30miss support, 85M/M/1 model, D-33MTTF, 34–35multimedia instruction compiler

support, A-31 to A-32multiplication algorithm, J-19network effective bandwidth, F-18network topologies, F-41 to F-43Ocean application, I-11 to I-12packet latency, F-14 to F-15parallel processing, 349–350, I-33

to I-34pipeline execution rate, C-10 to

C-11pipeline structural hazards, C-14 to

C-15power-performance benchmarks,

439–440predicated instructions, H-25processor performance

comparison, 218–219queue I/O requests, D-29queue waiting time, D-28 to D-29queuing, D-31radix-4 SRT division, J-56redundant power supply reliability,

35ROB commit, 187ROB instructions, 189

scoreboarding, C-77sequential consistency, 393server costs, 454–455server power, 463signed-digit numbers, J-53signed numbers, J-7SIMD multimedia instructions,

284–285single-precision numbers, J-15,

J-17software pipelining, H-13 to H-14speedup, 47status tables, 178strides, 279TB-80 cluster MTTF, D-41TB-80 IOPS, D-39 to D-40torus topology interconnections,

F-36 to F-38true sharing misses and false

sharing, 366–367VAX instructions, K-67vector memory systems, G-9vector performance, G-8vector vs. scalar operation, G-19vector sequence chimes, 270VLIW processors, 195VMIPS vector operation, G-6 to

G-7way selection, 82write buffer and read misses, B-35

to B-36write vs. no-write allocate, B-12WSC memory latency, 445WSC running service availability,

434–435WSC server data transfer, 446

ExceptionsALU instructions, C-4architecture-specific examples,

C-44categories, C-46control dependence, 154–155floating-point arithmetic, J-34 to

J-35hardware-based speculation, 190imprecise, 169–170, 188long latency pipelines, C-55MIPS, C-48, C-48 to C-49out-of-order completion, 169–170precise, C-47, C-58 to C-60preservation via hardward support,

H-28 to H-32

return address buffer, 207ROB instructions, 190speculative execution, 222stopping/restarting, C-46 to C-47types and requirements, C-43 to

C-46Execute step

instruction steps, 174Itanium 2, H-42ROB instruction, 186TI 320C55 DSP, E-7

Execution address cycle (EX)basic MIPS pipeline, C-36data hazards requiring stalls, C-21data hazard stall minimization,

C-17exception stopping/restarting, C-46

to C-47hazards and forwarding, C-56 to

C-57MIPS FP operations, basic

considerations, C-51 to C-53

MIPS pipeline, C-52MIPS pipeline control, C-36 to

C-39MIPS R4000, C-63 to C-64, C-64MIPS scoreboarding, C-72, C-74,

C-77out-of-order execution, C-71pipeline branch issues, C-40, C-42RISC classic pipeline, C-10simple MIPS implementation,

C-31 to C-32simple RISC implementation, C-6

Execution timeAmdahl’s law, 46–47, 406application/OS misses, B-59cache performance, B-3 to B-4,

B-16calculation, 36commercial workloads, 369–370,

370energy efficiency, 211integrated circuits, 22loop unrolling, 160multilevel caches, B-32 to B-34multiprocessor performance,

405–406multiprogrammed parallel “make”

workload, 375multithreading, 232

Page 24: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-26 ■ Index

Execution time (continued )performance equations, B-22pipelining performance, C-3, C-10

to C-11PMDs, 6principle of locality, 45processor comparisons, 243processor performance equation,

49, 51reduction, B-19second-level cache size, B-34SPEC benchmarks, 42–44, 43, 56and stall time, B-21vector length, G-7vector mask registers, 276vector operations, 268–271

Expand-down field, B-53Explicit operands, ISA classifications,

A-3 to A-4Explicit parallelism, IA-64, H-34 to

H-35Explicit unit stride, GPUs vs. vector

architectures, 310Exponential back-off

large-scale multiprocessor synchronization, I-17

spin lock, I-17Exponential distribution, definition,

D-27Extended accumulator

flawed architectures, A-44ISA classification, A-3

FFacebook, 460Failures, see also Mean time between

failures (MTBF); Mean time to failure (MTTF)

Amdahl’s law, 56Berkeley’s Tertiary Disk project,

D-12cloud computing, 455definition, D-10dependability, 33–35dirty bits, D-61 to D-64DRAM, 473example calculation, 48Google WSC networking, 469–470power failure, C-43 to C-44, C-46power utilities, 435RAID reconstruction, D-55 to

D-57

RAID row-diagonal parity, D-9rate calculations, 48servers, 7, 434SLA states, 34storage system components, D-43storage systems, D-6 to D-10TDP, 22Tertiary Disk, D-13WSC running service, 434–435WSCs, 8, 438–439WSC storage, 442–443

False sharingdefinition, 366–367shared-memory workload, 373

FarmVille, 460Fast Fourier transformation (FFT)

characteristics, I-7distributed-memory

multiprocessor, I-32example calculations, I-27 to I-29symmetric shared-memory

multiprocessors, I-22, I-23, I-25

Fast traps, SPARC instructions, K-30Fat trees

definition, F-34NEWS communication, F-43routing algorithms, F-48SAN characteristics, F-76topology, F-38 to F-39torus topology interconnections,

F-36 to F-38Fault detection, pitfalls, 57–58Fault-induced deadlock, routing, F-44Faulting prefetches, cache

optimization, 92Faults, see also Exceptions; Page

faultsaddress fault, B-42definition, D-10and dependability, 33dependability benchmarks, D-21programming mistakes, D-11storage systems, D-6 to D-10Tandem Computers, D-12 to D-13VAX systems, C-44

Fault toleranceand adaptive routing, F-94commercial interconnection

networks, F-66 to F-69DECstation 5000 reboots, F-69dependability benchmarks, D-21

RAID, D-7SAN example, F-74WSC memory, 473–474WSC network, 461

Fault-tolerant routing, commercial interconnection networks, F-66 to F-67

FC, see Fibre Channel (FC)FC-AL, see Fibre Channel Arbitrated

Loop (FC-AL)FC-SW, see Fibre Channel Switched

(FC-SW)Feature size

dependability, 33integrated circuits, 19–21

FEC, see Forward error correction (FEC)

Federal Communications Commission (FCC), telephone company outages, D-15

Fermi GPUarchitectural innovations, 305–308future features, 333Grid mapping, 293multithreaded SIMD Processor,

307NVIDIA, 291, 305SIMD, 296–297SIMD Thread Scheduler, 306

Fermi Tesla, GPU computing history, L-52

Fermi Tesla GTX 280GPU comparison, 324–325, 325memory bandwidth, 328raw/relative GPU performance,

328synchronization, 329weaknesses, 330

Fermi Tesla GTX 480floorplan, 295GPU comparisons, 323–330, 325

Fetch-and-incrementlarge-scale multiprocessor

synchronization, I-20 to I-21

sense-reversing barrier, I-21synchronization, 388

Fetching, see Data fetchingFetch stage, TI 320C55 DSP, E-7FFT, see Fast Fourier transformation

(FFT)Fibre Channel (FC), F-64, F-67, F-102

Page 25: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-27

file system benchmarking, D-20NetApp FAS6000 filer, D-42

Fibre Channel Arbitrated Loop (FC-AL), F-102

block servers vs. filers, D-35SCSI history, L-81

Fibre Channel Switched (FC-SW), F-102

Field-programmable gate arrays (FPGAs), WSC array switch, 443

FIFO, see First-in first-out (FIFO)Filers

vs. block servers, D-34 to D-35NetApp FAS6000 filer, D-41 to

D-42Filer servers, SPEC benchmarking,

D-20 to D-21Filters, radio receiver, E-23Fine-grained multithreading

definition, 224–226Sun T1 effectiveness, 226–229

Fingerprint, storage system, D-49Finite-state machine, routing

implementation, F-57Firmware, network interfaces, F-7First-in first-out (FIFO)

block replacement, B-9cache misses, B-10definition, D-26Tomasulo’s algorithm, 173

First-level caches, see also L1 cachesARM Cortex-A8, 114cache optimization, B-30 to B-32hit time/power reduction, 79–80inclusion, B-35interconnection network, F-87Itanium 2, H-41memory hierarchy, B-48 to B-49miss rate calculations, B-31 to

B-35parameter ranges, B-42technology trends, 18virtual memory, B-42

First-reference misses, definition, B-23

FIT rates, WSC memory, 473–474Fixed-field decoding, simple RISC

implementation, C-6Fixed-length encoding

general-purpose registers, A-6instruction sets, A-22

ISAs, 14Fixed-length vector

SIMD, 284vector registers, 264

Fixed-point arithmetic, DSP, E-5 to E-6

Flagsperformance benchmarks, 37performance reporting, 41scoreboarding, C-75

Flash memorycharacteristics, 102–104dependability, 104disk storage, D-3 to D-4embedded benchmarks, E-13memory hierarchy design, 72technology trends, 18WSC cost-performance, 474–475

FLASH multiprocessor, L-61Flexible chaining

vector execution time, 269vector processor, G-11

Floating-point (FP) operationsaddition

denormals, J-26 to J-27overview, J-21 to J-25rules, J-24speedup, J-25 to J-26

arithmetic intensity, 285–288, 286branch condition evaluation, A-19branches, A-20cache misses, 83–84chip comparison, J-58control flow instructions, A-21CPI calculations, 50–51data access benchmarks, A-15data dependences, 151data hazards, 169denormal multiplication, J-20 to

J-21denormals, J-14 to J-15desktop RISCs, K-13, K-17, K-23DSP media extensions, E-10 to E-11dynamic scheduling with

Tomasulo’s algorithm, 171–172, 173

early computer arithmetic, J-64 to J-65

exceptions, J-34 to J-35exception stopping/restarting, C-47fused multiply-add, J-32 to J-33IBM 360, K-85

IEEE 754 FP standard, J-16ILP exploitation, 197–199ILP exposure, 157–158ILP in perfect processor, 215ILP for realizable processors,

216–218independent, C-54instruction operator categories,

A-15integer conversions, J-62Intel Core i7, 240, 241Intel 80x86, K-52 to K-55, K-54,

K-61Intel 80x86 registers, K-48ISA performance and efficiency

prediction, 241Itanium 2, H-41iterative division, J-27 to J-31latencies, 157and memory bandwidth, J-62MIPS, A-38 to A-39

Tomasulo’s algorithm, 173MIPS exceptions, C-49MIPS operations, A-35MIPS pipeline, C-52

basic considerations, C-51 to C-54

execution, C-71performance, C-60 to C-61,

C-61scoreboarding, C-72stalls, C-62

MIPS precise exceptions, C-58 to C-60

MIPS R4000, C-65 to C-67, C-66 to C-67

MIPS scoreboarding, C-77MIPS with scoreboard, C-73misspeculation instructions, 212Multimedia SIMD Extensions, 285multimedia support, K-19multiple lane vector unit, 273multiple outstanding, C-54multiplication

examples, J-19overview, J-17 to J-20

multiplication precision, J-21number representation, J-15 to J-16operand sizes/types, 12overflow, J-11overview, J-13 to J-14parallelism vs. window size, 217

Page 26: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-28 ■ Index

Floating-point operations (continued)pipeline hazards and forwarding,

C-55 to C-57pipeline structural hazards, C-16precisions, J-33 to J-34remainder, J-31 to J-32ROB commit, 187SMT, 398–400SPARC, K-31SPEC benchmarks, 39special values, J-14 to J-15stalls from RAW hazards, C-55static branch prediction, C-26 to

C-27Tomasulo’s algorithm, 185underflow, J-36 to J-37, J-62VAX, B-73vector chaining, G-11vector sequence chimes, 270VLIW processors, 195VMIPS, 264

Floating-point registers (FPRs)IA-64, H-34IBM Blue Gene/L, I-42MIPS data transfers, A-34MIPS operations, A-36MIPS64 architecture, A-34write-back, C-56

Floating-point square root (FPSQR)calculation, 47–48CPI calculations, 50–51

Floating Point Systems AP-120B, L-28

Floppy disks, L-78Flow-balanced state, D-23Flow control

and arbitration, F-21congestion management, F-65direct networks, F-38 to F-39format, F-58interconnection networks, F-10 to

F-11system area network history, F-100

to F-101Fluent, F-76, F-77Flush, branch penalty reduction, C-22FM, see Frequency modulation (FM)Form factor, interconnection

networks, F-9 to F-12FORTRAN

compiler types and classes, A-28compiler vectorization, G-14, G-15

dependence analysis, H-6integer division/remainder, J-12loop-level parallelism

dependences, 320–321MIPS scoreboarding, C-77performance measurement history,

L-6return address predictors, 206

Forward error correction (FEC), DSP, E-5 to E-7

Forwarding, see also BypassingALUs, C-40 to C-41data hazard stall minimization,

C-16 to C-19, C-18dynamically scheduled pipelines,

C-70 to C-71load instruction, C-20longer latency pipelines, C-54 to

C-58operand, C-19

Forwarding tablerouting implementation, F-57switch microarchitecture

pipelining, F-60Forward path, cell phones, E-24Fourier-Motzkin algorithm, L-31Fourier transform, DSP, E-5Four-way conflict misses, definition,

B-23FP, see Floating-point (FP) operationsFPGAs, see Field-programmable gate

arrays (FPGAs)FPRs, see Floating-point registers

(FPRs)FPSQR, see Floating-point square root

(FPSQR)Frame pointer, VAX, K-71Freeze, branch penalty reduction,

C-22Frequency modulation (FM), wireless

neworks, E-21Front-end stage, Itanium 2, H-42FU, see Functional unit (FU)Fujitsu Primergy BX3000 blade

server, F-85Fujitsu VP100, L-45, L-47Fujitsu VP200, L-45, L-47Full access

dimension-order routing, F-47 to F-48

interconnection network topology, F-29

Full adders, J-2, J-3Fully associative cache

block placement, B-7conflict misses, B-23direct-mapped cache, B-9memory hierarchy basics, 74

Fully connected topologydistributed switched networks,

F-34NEWS communication, F-43

Functional hazardsARM Cortex-A8, 233microarchitectural techniques case

study, 247–254Functional unit (FU)

FP operations, C-66instruction execution example,

C-80Intel Core i7, 237Itanium 2, H-41 to H-43latencies, C-53MIPS pipeline, C-52MIPS scoreboarding, C-75 to C-80OCNs, F-3vector add instruction, 272,

272–273VMIPS, 264

Function callsGPU programming, 289NVIDIA GPU Memory structures,

304–305PTX assembler, 301

Function pointers, control flow instruction addressing modes, A-18

Fused multiply-add, floating point, J-32 to J-33

Future file, precise exceptions, C-59

GGateways, Ethernet, F-79Gather-Scatter

definition, 309GPU comparisons, 329multimedia instruction compiler

support, A-31sparse matrices, G-13 to G-14vector architectures, 279–280

GCD, see Greatest common divisor (GCD) test

GDDR, see Graphics double data rate (GDDR)

Page 27: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-29

GDRAM, see Graphics dynamic random-access memory (GDRAM)

GE 645, L-9General-Purpose Computing on GPUs

(GPGPU), L-51 to L-52General-purpose electronic computers,

historical background, L-2 to L-4

General-purpose registers (GPRs)advantages/disadvantages, A-6IA-64, H-38Intel 80x86, K-48ISA classification, A-3 to A-5MIPS data transfers, A-34MIPS operations, A-36MIPS64, A-34VMIPS, 265

GENI, see Global Environment for Network Innovation (GENI)

Geometric means, example calculations, 43–44

GFS, see Google File System (GFS)Gibson mix, L-6Giga Thread Engine, definition, 292,

314Global address space, segmented

virtual memory, B-52Global code scheduling

example, H-16parallelism, H-15 to H-23superblock scheduling, H-21 to

H-23, H-22trace scheduling, H-19 to H-21,

H-20Global common subexpression

elimination, compiler structure, A-26

Global data area, and compiler technology, A-27

Global Environment for Network Innovation (GENI), F-98

Global load/store, definition, 309Global Memory

definition, 292, 314GPU programming, 290locks via coherence, 390

Global miss ratedefinition, B-31multilevel caches, B-33

Global optimizationscompilers, A-26, A-29optimization types, A-28

Global Positioning System, CDMA, E-25Global predictors

Intel Core i7, 166tournament predictors, 164–166

Global scheduling, ILP, VLIW processor, 194

Global system for mobile communication (GSM), cell phones, E-25

Goldschmidt’s division algorithm, J-29, J-61

Goldstine, Herman, L-2 to L-3Google

Bigtable, 438, 441cloud computing, 455cluster history, L-62containers, L-74MapReduce, 437, 458–459, 459server CPUs, 440server power-performance

benchmarks, 439–441WSCs, 432, 449

containers, 464–465, 465cooling and power, 465–468monitoring and repairing,

469–470PUE, 468servers, 467, 468–469

Google App Engine, L-74Google Clusters

memory dependability, 104power consumption, F-85

Google File System (GFS)MapReduce, 438WSC storage, 442–443

Google GogglesPMDs, 6user experience, 4

Google searchshared-memory workloads, 369workload demands, 439

Gordon Bell Prize, L-57GPGPU (General-Purpose Computing

on GPUs), L-51 to L-52GPRs, see General-purpose registers

(GPRs)GPU (Graphics Processing Unit)

banked and graphics memory, 322–323

computing history, L-52definition, 9DLP

basic considerations, 288basic PTX thread instructions,

299conditional branching, 300–303coprocessor relationship,

330–331definitions, 309Fermi GPU architecture

innovations, 305–308Fermi GTX 480 floorplan, 295GPUs vs. vector architectures,

308–312, 310mapping examples, 293Multimedia SIMD comparison,

312multithreaded SIMD Processor

block diagram, 294NVIDIA computational

structures, 291–297NVIDIA/CUDA and AMD

terminology, 313–315NVIDIA GPU ISA, 298–300NVIDIA GPU Memory

structures, 304, 304–305programming, 288–291SIMD thread scheduling, 297terminology, 292

fine-grained multithreading, 224future features, 332gather/scatter operations, 280historical background, L-50loop-level parallelism, 150vs. MIMD with Multimedia SIMD,

324–330mobile client/server features, 324,

324power/DLP issues, 322raw/relative performance, 328Roofline model, 326scalable, L-50 to L-51strided access-TLB interactions,

323thread count and memory

performance, 332TLP, 346vector kernel implementation,

334–336vs. vector processor operation,

276

Page 28: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-30 ■ Index

GPU Memorycaches, 306CUDA program, 289definition, 292, 309, 314future architectures, 333GPU programming, 288NVIDIA, 304, 304–305splitting from main memory, 330

Gradual underflow, J-15, J-36Grain size

MIMD, 10TLP, 346

Grant phase, arbitration, F-49Graph coloring, register allocation,

A-26 to A-27Graphics double data rate (GDDR)

characteristics, 102Fermi GTX 480 GPU, 295, 324

Graphics dynamic random-access memory (GDRAM)

bandwidth issues, 322–323characteristics, 102

Graphics-intensive benchmarks, desktop performance, 38

Graphics pipelines, historical background, L-51

Graphics Processing Unit, see GPU (Graphics Processing Unit)

Graphics synchronous dynamic random-access memory (GSDRAM), characteristics, 102

Graphics Synthesizer, Sony PlayStation 2, E-16, E-16 to E-17

Greater than condition code, PowerPC, K-10 to K-11

Greatest common divisor (GCD) test, loop-level parallelism dependences, 319, H-7

Gridarithmetic intensity, 286CUDA parallelism, 290definition, 292, 309, 313and GPU, 291GPU Memory structures, 304GPU terms, 308mapping example, 293NVIDIA GPU computational

structures, 291

SIMD Processors, 295Thread Blocks, 295

Grid computing, L-73 to L-74Grid topology

characteristics, F-36direct networks, F-37

GSDRAM, see Graphics synchronous dynamic random-access memory (GSDRAM)

GSM, see Global system for mobile communication (GSM)

Guest definition, 108Guest domains, Xen VM, 111

HHadoop, WSC batch processing, 437Half adders, J-2Half words

aligned/misaligned addresses, A-8memory address interpretation,

A-7 to A-8MIPS data types, A-34operand sizes/types, 12as operand type, A-13 to A-14

Handshaking, interconnection networks, F-10

Hard drive, power consumption, 63Hard real-time systems, definition, E-3

to E-4Hardware

as architecture component, 15cache optimization, 96compiler scheduling support, L-30

to L-31compiler speculation support

memory references, H-32overview, H-27preserving exception behavior,

H-28 to H-32description notation, K-25energy/performance fallacies, 56for exposing parallelism, H-23 to

H-27ILP approaches, 148, 214–215interconnection networks, F-9pipeline hazard detection, C-38Virtual Machines protection, 108WSC cost-performance, 474WSC running service, 434–435

Hardware-based speculationbasic algorithm, 191

data flow execution, 184FP unit using Tomasulo’s

algorithm, 185ILP

data flow execution, 184with dynamic scheduling and

multiple issue, 197–202FP unit using Tomasulo’s

algorithm, 185key ideas, 183–184multiple-issue processors, 198reorder buffer, 184–192vs. software speculation,

221–222key ideas, 183–184

Hardware faults, storage systems, D-11

Hardware prefetchingcache optimization, 131–133miss penalty/rate reduction, 91–92NVIDIA GPU Memory structures,

305SPEC benchmarks, 92

Hardware primitiviesbasic types, 387–389large-scale multiprocessor

synchronization, I-18 to I-21

synchronization mechanisms, 387–389

Harvard architecture, L-4Hazards, see also Data hazards

branch hazards, C-21 to C-26, C-39 to C-42, C-42

control hazards, 235, C-11detection, hardware, C-38dynamically scheduled pipelines,

C-70 to C-71execution sequences, C-80functional hazards, 233, 247–254instruction set complications, C-50longer latency pipelines, C-54 to

C-58structural hazards, 268–269, C-11,

C-13 to C-16, C-71, C-78 to C-79

HCAs, see Host channel adapters (HCAs)

Headermessages, F-6packet format, F-7

Page 29: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-31

switch microarchitecture pipelining, F-60

TCP/IP, F-84Head-of-line (HOL) blocking

congestion management, F-64switch microarchitecture, F-58 to

F-59, F-59, F-60, F-62system area network history, F-101virtual channels and throughput,

F-93Heap, and compiler technology, A-27

to A-28HEP processor, L-34Heterogeneous architecture,

definition, 262Hewlett-Packard AlphaServer,

F-100Hewlett-Packard PA-RISC

addressing modes, K-5arithmetic/logical instructions,

K-11characteristics, K-4conditional branches, K-12, K-17,

K-34constant extension, K-9conventions, K-13data transfer instructions, K-10EPIC, L-32features, K-44floating-point precisions, J-33FP instructions, K-23MIPS core extensions, K-23multimedia support, K-18, K-18,

K-19unique instructions, K-33 to K-36

Hewlett-Packard PA-RISC MAX2, multimedia support, E-11

Hewlett-Packard Precision Architecture, integer arithmetic, J-12

Hewlett-Packard ProLiant BL10e G2 Blade server, F-85

Hewlett-Packard ProLiant SL2x170z G6, SPECPower benchmarks, 463

Hewlett-Packard RISC microprocessors, vector processor history, G-26

Higher-radix division, J-54 to J-55Higher-radix multiplication, integer,

J-48

High-level language computer architecture (HLLCA), L-18 to L-19

High-level optimizations, compilers, A-26

Highly parallel memory systems, case studies, 133–136

High-order functions, control flow instruction addressing modes, A-18

High-performance computing (HPC)InfiniBand, F-74interconnection network

characteristics, F-20interconnection network topology,

F-44storage area network history, F-102switch microarchitecture, F-56vector processor history, G-27write strategy, B-10vs. WSCs, 432, 435–436

Hillis, Danny, L-58, L-74Histogram, D-26 to D-27History file, precise exceptions, C-59Hitachi S810, L-45, L-47Hitachi SuperH

addressing modes, K-5, K-6arithmetic/logical instructions,

K-24branches, K-21characteristics, K-4condition codes, K-14data transfer instructions, K-23embedded instruction format, K-8multiply-accumulate, K-20unique instructions, K-38 to K-39

Hit timeaverage memory access time, B-16

to B-17first-level caches, 79–80memory hierarchy basics, 77–78reduction, 78, B-36 to B-40way prediction, 81–82

HLLCA, see High-level language computer architecture (HLLCA)

HOL, see Head-of-line blocking (HOL)

Home node, directory-based cache coherence protocol basics, 382

Hop count, definition, F-30

Hopsdirect network topologies, F-38routing, F-44switched network topologies, F-40switching, F-50

Host channel adapters (HCAs)historical background, L-81switch vs. NIC, F-86

Host definition, 108, 305Hot swapping, fault tolerance, F-67HPC, see High-performance

computing (HPC)HPC Challenge, vector processor

history, G-28HP-Compaq servers

price-performance differences, 441SMT, 230

HPSm, L-29Hypercube networks

characteristics, F-36deadlock, F-47direct networks, F-37vs. direct networks, F-92NEWS communication, F-43

HyperTransport, F-63NetApp FAS6000 filer, D-42

Hypertransport, AMD Opteron cache coherence, 361

Hypervisor, characteristics, 108

IIAS machine, L-3, L-5 to L-6IBM

Chipkill, 104cluster history, L-62, L-72computer history, L-5 to L-6early VM work, L-10magnetic storage, L-77 to L-78multiple-issue processor

development, L-28RAID history, L-79 to L-80

IBM 360address space, B-58architecture, K-83 to K-84architecture flaws and success,

K-81branch instructions, K-86characteristics, K-42computer architecture definition,

L-17 to L-18instruction execution frequencies,

K-89

Page 30: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-32 ■ Index

IBM 360 (continued )instruction operator categories,

A-15instruction set, K-85 to K-88instruction set complications, C-49

to C-50integer/FP R-R operations, K-85I/O bus history, L-81memory hierarchy development,

L-9 to L-10parallel processing debates, L-57protection and ISA, 112R-R instructions, K-86RS and SI format instructions,

K-87RX format instructions, K-86 to

K-87SS format instructions, K-85 to

K-88IBM 360/85, L-10 to L-11, L-27IBM 360/91

dynamic scheduling with Tomasulo’s algorithm, 170–171

early computer arithmetic, J-63history, L-27speculation concept origins, L-29

IBM 370architecture, K-83 to K-84characteristics, K-42early computer arithmetic, J-63integer overflow, J-11protection and ISA, 112vector processor history, G-27Virtual Machines, 110

IBM 370/158, L-7IBM 650, L-6IBM 701, L-5 to L-6IBM 702, L-5 to L-6IBM 704, L-6, L-26IBM 705, L-6IBM 801, L-19IBM 3081, L-61IBM 3090 Vector Facility, vector

processor history, G-27IBM 3840 cartridge, L-77IBM 7030, L-26IBM 9840 cartridge, L-77IBM AS/400, L-79IBM Blue Gene/L, F-4

adaptive routing, F-93cluster history, L-63

commercial interconnection networks, F-63

computing node, I-42 to I-44, I-43as custom cluster, I-41 to I-42deterministic vs. adaptive routing,

F-52 to F-55fault tolerance, F-66 to F-67link bandwidth, F-89low-dimensional topologies, F-100parallel processing debates, L-58software overhead, F-91switch microarchitecture, F-62system, I-44system area network history, F-101

to F-1023D torus network, F-72 to F-74topology, F-30, F-39

IBM CodePack, RISC code size, A-23IBM CoreConnect

cross-company interoperability, F-64

OCNs, F-3IBM eServer p5 processor

performance/cost benchmarks, 409SMT and ST performance, 399speedup benchmarks, 408,

408–409IBM Federation network interfaces,

F-17 to F-18IBM J9 JVM

real-world server considerations, 52–55

WSC performance, 463IBM PCs, architecture flaws vs.

success, A-45IBM Power processors

branch-prediction buffers, C-29characteristics, 247exception stopping/restarting, C-47MIPS precise exceptions, C-59shared-memory multiprogramming

workload, 378IBM Power 1, L-29IBM Power 2, L-29IBM Power 4

multithreading history, L-35peak performance, 58recent advances, L-33 to L-34

IBM Power 5characteristics, F-73Itanium 2 comparison, H-43manufacturing cost, 62

multiprocessing/multithreading-based performance, 398–400

multithreading history, L-35IBM Power 7

vs. Google WSC, 436ideal processors, 214–215multicore processor performance,

400–401multithreading, 225

IBM Pulsar processor, L-34IBM RP3, L-60IBM RS/6000, L-57IBM RT-PC, L-20IBM SAGE, L-81IBM servers, economies of scale, 456IBM Stretch, L-6IBM zSeries, vector processor history,

G-27IC, see Instruction count (IC)I-caches

case study examples, B-63way prediction, 81–82

ICR, see Idle Control Register (ICR)ID, see Instruction decode (ID)Ideal pipeline cycles per instruction,

ILP concepts, 149Ideal processors, ILP hardware model,

214–215, 219–220IDE disks, Berkeley’s Tertiary Disk

project, D-12Idle Control Register (ICR), TI

TMS320C55 DSP, E-8Idle domains, TI TMS320C55 DSP,

E-8IEEE 754 floating-point standard, J-16IEEE 1394, Sony PlayStation 2

Emotion Engine case study, E-15

IEEE arithmeticfloating point, J-13 to J-14

addition, J-21 to J-25exceptions, J-34 to J-35remainder, J-31 to J-32underflow, J-36

historical background, J-63 to J-64iterative division, J-30–x vs. 0 –x, J-62NaN, J-14rounding modes, J-20single-precision numbers, J-15 to

J-16

Page 31: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-33

IEEE standard 802.3 (Ethernet), F-77 to F-79

LAN history, F-99IF, see Instruction fetch (IF) cycleIF statement handling

control dependences, 154GPU conditional branching, 300,

302–303memory consistency, 392vectorization in code, 271vector-mask registers, 267, 275–276

Illiac IV, F-100, L-43, L-55ILP, see Instruction-level parallelism

(ILP)Immediate addressing mode

ALU operations, A-12basic considerations, A-10 to A-11MIPS, 12MIPS instruction format, A-35MIPS operations, A-37value distribution, A-13

IMPACT, L-31Implicit operands, ISA classifications,

A-3Implicit unit stride, GPUs vs. vector

architectures, 310Imprecise exceptions

data hazards, 169–170floating-point, 188

IMT-2000, see International Mobile Telephony 2000 (IMT-2000)

Inactive power modes, WSCs, 472Inclusion

cache hierarchy, 397–398implementation, 397–398invalidate protocols, 357memory hierarchy history, L-11

Indexed addressingIntel 80x86, K-49, K-58VAX, K-67

Indexesaddress translation during, B-36 to

B-40AMD Opteron data cache, B-13 to

B-14ARM Cortex-A8, 115recurrences, H-12size equations, B-22

Index field, block identification, B-8Index vector, gather/scatter operations,

279–280

Indirect addressing, VAX, K-67Indirect networks, definition, F-31Inexact exception

floating-point arithmetic, J-35floating-point underflow, J-36

InfiniBand, F-64, F-67, F-74 to F-77cluster history, L-63packet format, F-75storage area network history,

F-102switch vs. NIC, F-86system area network history, F-101

Infinite population model, queuing model, D-30

In flight instructions, ILP hardware model, 214

Information tables, examples, 176–177

Infrastructure costsWSC, 446–450, 452–455, 453WSC efficiency, 450–452

Initiation interval, MIPS pipeline FP operations, C-52 to C-53

Initiation ratefloating-point pipeline, C-65 to

C-66memory banks, 276–277vector execution time, 269

Inktomi, L-62, L-73In-order commit

hardware-based speculation, 188–189

speculation concept origins, L-29In-order execution

average memory access time, B-17 to B-18

cache behavior calculations, B-18cache miss, B-2 to B-3dynamic scheduling, 168–169IBM Power processors, 247ILP exploitation, 193–194multiple-issue processors, 194superscalar processors, 193

In-order floating-point pipeline, dynamic scheduling, 169

In-order issueARM Cortex-A8, 233dynamic scheduling, 168–170,

C-71ISA, 241

In-order scalar processors, VMIPS, 267

Input buffered switchHOL blocking, F-59, F-60microarchitecture, F-57, F-57pipelined version, F-61

Input-output buffered switch, microarchitecture, F-57

Instruction cacheAMD Opteron example, B-15antialiasing, B-38application/OS misses, B-59branch prediction, C-28commercial workload, 373GPU Memory, 306instruction fetch, 202–203, 237ISA, 241MIPS R4000 pipeline, C-63miss rates, 161multiprogramming workload,

374–375prefetch, 236RISCs, A-23TI TMS320C55 DSP, E-8

Instruction commithardware-based speculation,

184–185, 187–188, 188, 190

instruction set complications, C-49Intel Core i7, 237speculation support, 208–209

Instruction count (IC)addressing modes, A-10cache performance, B-4, B-16compiler optimization, A-29, A-29

to A-30processor performance time, 49–51RISC history, L-22

Instruction decode (ID)basic MIPS pipeline, C-36branch hazards, C-21data hazards, 169hazards and forwarding, C-55 to

C-57MIPS pipeline, C-71MIPS pipeline control, C-36 to

C-39MIPS pipeline FP operations, C-53MIPS scoreboarding, C-72 to C-74out-of-order execution, 170pipeline branch issues, C-39 to

C-41, C-42RISC classic pipeline, C-7 to C-8,

C-10

Page 32: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-34 ■ Index

Instruction decode (continued )simple MIPS implementation, C-31simple RISC implementation, C-5

to C-6Instruction delivery stage, Itanium 2,

H-42Instruction fetch (IF) cycle

basic MIPS pipeline, C-35 to C-36branch hazards, C-21branch-prediction buffers, C-28exception stopping/restarting, C-46

to C-47MIPS exceptions, C-48MIPS R4000, C-63pipeline branch issues, C-42RISC classic pipeline, C-7, C-10simple MIPS implementation,

C-31simple RISC implementation, C-5

Instruction fetch unitsintegrated, 207–208Intel Core i7, 237

Instruction formatsARM-unique, K-36 to K-37high-level language computer

architecture, L-18IA-64 ISA, H-34 to H-35, H-38,

H-39IBM 360, K-85 to K-88Intel 80x86, K-49, K-52, K-56 to

K-57M32R-unique, K-39 to K-40MIPS16-unique, K-40 to K-42PA-RISC unique, K-33 to K-36PowerPC-unique, K-32 to K-33RISCs, K-43

Alpha-unique, K-27 to K-29arithmetic/logical, K-11, K-15branches, K-25control instructions, K-12,

K-16data transfers, K-10, K-14,

K-21desktop/server, K-7desktop/server systems, K-7embedded DSP extensions,

K-19embedded systems, K-8FP instructions, K-13hardware description notation,

K-25MIPS64-unique, K-24 to K-27

MIPS core, K-6 to K-9MIPS core extensions, K-19 to

K-24MIPS unaligned word reads,

K-26multimedia extensions, K-16 to

K-19overview, K-5 to K-6

SPARC-unique, K-29 to K-32SuperH-unique, K-38 to K-39Thumb-unique, K-37 to K-38

Instruction groups, IA-64, H-34Instruction issue

definition, C-36DLP, 322dynamic scheduling, 168–169,

C-71 to C-72ILP, 197, 216–217instruction-level parallelism, 2Intel Core i7, 238Itanium 2, H-41 to H-43MIPS pipeline, C-52multiple issue processor, 198multithreading, 223, 226parallelism measurement, 215precise exceptions, C-58, C-60processor comparison, 323ROB, 186speculation support, 208, 210Tomasulo’s scheme, 175, 182

Instruction-level parallelism (ILP)ARM Cortex-A8, 233–236,

235–236basic concepts/challenges,

148–149, 149“big and dumb” processors, 245branch-prediction buffers, C-29,

C-29 to C-30compiler scheduling, L-31compiler techniques for exposure,

156–162control dependence, 154–156data dependences, 150–152data flow limit, L-33definition, 9, 149–150dynamic scheduling

basic concept, 168–169definition, 168example and algorithms,

176–178multiple issue, speculation,

197–202

overcoming data hazards, 167–176

Tomasulo’s algorithm, 170–176, 178–179, 181–183

early studies, L-32 to L-33exploitation methods, H-22 to

H-23exploitation statically, H-2exposing with hardware support,

H-23GPU programming, 289hardware-based speculation,

183–192hardware vs. software speculation,

221–222IA-64, H-32instruction fetch bandwidth

basic considerations, 202–203branch-target buffers, 203–206,

204integrated units, 207–208return address predictors,

206–207Intel Core i7, 236–241limitation studies, 213–221microarchitectural techniques case

study, 247–254MIPS scoreboarding, C-77 to C-79multicore performance/energy

efficiency, 404multicore processor performance,

400multiple-issue processors, L-30multiple issue/static scheduling,

192–196multiprocessor importance, 344multithreading, basic

considerations, 223–226multithreading history, L-34 to L-35name dependences, 152–153perfect processor, 215pipeline scheduling/loop unrolling,

157–162processor clock rates, 244realizable processor limitations,

216–218RISC development, 2SMT on superscalar processors,

230–232speculation advantages/

disadvantages, 210–211

Page 33: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-35

speculation and energy efficiency, 211–212

speculation support, 208–210speculation through multiple

branches, 211speculative execution, 222–223Sun T1 fine-grained multithreading

effectiveness, 226–229switch to DLP/TLP/RLP, 4–5TI 320C6x DSP, E-8value prediction, 212–213

Instruction path length, processor performance time, 49

Instruction prefetchintegrated instruction fetch units,

208miss penalty/rate reduction, 91–92SPEC benchmarks, 92

Instruction register (IR)basic MIPS pipeline, C-35dynamic scheduling, 170MIPS implementation, C-31

Instruction set architecture (ISA), see also Intel 80x86 processors; Reduced Instruction Set Computer (RISC)

addressing modes, A-9 to A-10architect-compiler writer

relationship, A-29 to A-30

ARM Cortex-A8, 114case studies, A-47 to A-54class code sequence example, A-4classification, A-3 to A-7code size-compiler considerations,

A-43 to A-44compiler optimization and

performance, A-27compiler register allocation, A-26

to A-27compiler structure, A-24 to A-26compiler technology and

architecture decisions, A-27 to A-29

compiler types and classes, A-28complications, C-49 to C-51computer architecture definition,

L-17 to L-18control flow instructions

addressing modes, A-17 to A-18

basic considerations, A-16 to A-17, A-20 to A-21

conditional branch options, A-19

procedure invocation options, A-19 to A-20

Cray X1, G-21 to G-22data access distribution example,

A-15definition and types, 11–15displacement addressing mode,

A-10encoding considerations, A-21 to

A-24, A-22, A-24first vector computers, L-48flawless design, A-45flaws vs. success, A-44 to A-45GPR advantages/disadvantages,

A-6high-level considerations, A-39,

A-41 to A-43high-level language computer

architecture, L-18 to L-19

IA-64instruction formats, H-39instructions, H-35 to H-37instruction set basics, H-38overview, H-32 to H-33predication and speculation,

H-38 to H-40IBM 360, K-85 to K-88immediate addressing mode, A-10

to A-11literal addressing mode, A-10 to

A-11memory addressing, A-11 to A-13memory address interpretation,

A-7 to A-8MIPS

addressing modes for data transfer, A-34

basic considerations, A-32 to A-33

control flow instructions, A-37 to A-38

data types, A-34dynamic instruction mix, A-41

to A-42, A-42FP operations, A-38 to A-39instruction format, A-35MIPS operations, A-35 to A-37

registers, A-34usage, A-39

MIPS64, 14, A-40multimedia instruction compiler

support, A-31 to A-32NVIDIA GPU, 298–300operand locations, A-4operands per ALU instruction, A-6operand type and size, A-13 to

A-14operations, A-14 to A-16operator categories, A-15overview, K-2performance and efficiency

prediction, 241–243and protection, 112RISC code size, A-23 to A-24RISC history, L-19 to L-22, L-21stack architectures, L-16 to L-17top 80x86 instructions, A-16“typical” program fallacy, A-43Virtual Machines protection,

107–108Virtual Machines support,

109–110VMIPS, 264–265VMM implementation, 128–129

Instructions per clock (IPC)ARM Cortex-A8, 236flawless architecture design, A-45ILP for realizable processors,

216–218MIPS scoreboarding, C-72multiprocessing/

multithreading-based performance, 398–400

processor performance time, 49Sun T1 multithreading unicore

performance, 229Sun T1 processor, 399

Instruction statusdynamic scheduling, 177MIPS scoreboarding, C-75

Integer arithmeticaddition speedup

carry-lookahead, J-37 to J-41carry-lookahead circuit, J-38carry-lookahead tree, J-40carry-lookahead tree adder,

J-41carry-select adder, J-43, J-43 to

J-44, J-44

Page 34: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-36 ■ Index

Integer arithmetic (continued )carry-skip adder, J-41 to J43,

J-42overview, J-37

divisionradix-2 division, J-55radix-4 division, J-56radix-4 SRT division, J-57with single adder, J-54 to J-58

FP conversions, J-62language comparison, J-12multiplication

array multiplier, J-50Booth recoding, J-49even/odd array, J-52with many adders, J-50 to J-54multipass array multiplier, J-51signed-digit addition table,

J-54with single adder, J-47 to J-49,

J-48Wallace tree, J-53

multiplication/division, shifting over zeros, J-45 to J-47

overflow, J-11Radix-2 multiplication/division,

J-4, J-4 to J-7restoring/nonrestoring division,

J-6ripply-carry addition, J-2 to J-3,

J-3signed numbers, J-7 to J-10SRT division, J-45 to J-47, J-46systems issues, J-10 to J-13

Integer operandflawed architecture, A-44GCD, 319graph coloring, A-27instruction set encoding, A-23MIPS data types, A-34as operand type, 12, A-13 to A-14

Integer operationsaddressing modes, A-11ALUs, A-12, C-54ARM Cortex-A8, 116, 232, 235,

236benchmarks, 167, C-69branches, A-18 to A-20, A-20cache misses, 83–84data access distribution, A-15data dependences, 151dependences, 322

desktop benchmarks, 38–39displacement values, A-12exceptions, C-43, C-45hardware ILP model, 215hardware vs. software speculation,

221hazards, C-57IBM 360, K-85ILP, 197–200instruction set operations, A-16Intel Core i7, 238, 240Intel 80x86, K-50 to K-51ISA, 242, A-2Itanium 2, H-41longer latency pipelines, C-55MIPS, C-31 to C-32, C-36, C-49,

C-51 to C-53MIPS64 ISA, 14MIPS FP pipeline, C-60MIPS R4000 pipeline, C-61, C-63,

C-70misspeculation, 212MVL, 274pipeline scheduling, 157precise exceptions, C-47, C-58,

C-60processor clock rate, 244R4000 pipeline, C-63realizable processor ILP, 216–218RISC, C-5, C-11scoreboarding, C-72 to C-73, C-76SIMD processor, 307SPARC, K-31SPEC benchmarks, 39speculation through multiple

branches, 211static branch prediction, C-26 to

C-27T1 multithreading unicore

performance, 227–229Tomasulo’s algorithm, 181tournament predictors, 164VMIPS, 265

Integer registershardware-based speculation, 192IA-64, H-33 to H-34MIPS dynamic instructions, A-41

to A-42MIPS floating-point operations,

A-39MIPS64 architecture, A-34VLIW, 194

Integrated circuit basicscell phones, E-24, E-24cost trends, 28–32dependability, 33–36logic technology, 17microprocessor developments, 2power and energy, 21–23scaling, 19–21

Intel 80286, L-9Intel Atom 230

processor comparison, 242single-threaded benchmarks, 243

Intel Atom processorsISA performance and efficiency

prediction, 241–243performance measurement,

405–406SMT, 231WSC memory, 474WSC processor cost-performance,

473Intel Core i7

vs. Alpha processors, 368architecture, 15basic function, 236–238“big and dumb” processors, 245branch predictor, 166–167clock rate, 244dynamic scheduling, 170GPU comparisons, 324–330, 325hardware prefetching, 91ISA performance and efficiency

prediction, 241–243L2/L3 miss rates, 125memory hierarchy basics, 78,

117–124, 119memory hierarchy design, 73memory performance, 122–124MESIF protocol, 362microprocessor die example, 29miss rate benchmarks, 123multibanked caches, 86multithreading, 225nonblocking cache, 83performance, 239, 239–241, 240performance/energy efficiency,

401–405pipelined cache access, 82pipeline structure, 237processor comparison, 242raw/relative GPU performance, 328Roofline model, 286–288, 287

Page 35: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-37

Intel Core i7 (continued )single-threaded benchmarks, 243SMP limitations, 363SMT, 230–231snooping cache coherence

implementation, 365three-level cache hierarchy, 118TLB structure, 118write invalid protocol, 356

Intel 80x86 processorsaddress encoding, K-58addressing modes, K-58address space, B-58architecture flaws and success, K-81architecture flaws vs. success,

A-44 to A-45Atom, 231cache performance, B-6characteristics, K-42common exceptions, C-44comparative operation

measurements, K-62 to K-64

floating-point operations, K-52 to K-55, K-54, K-61

instruction formats, K-56 to K-57instruction lengths, K-60instruction mix, K-61 to K-62instructions vs. DLX, K-63 to

K-64instruction set encoding, A-23,

K-55instruction set usage

measurements, K-56 to K-64

instructions and functions, K-52instruction types, K-49integer operations, K-50 to K-51integer overflow, J-11Intel Core i7, 117ISA, 11–12, 14–15, A-2memory accesses, B-6memory addressing, A-8memory hierarchy development,

L-9multimedia support, K-17operand addressing mode, K-59,

K-59 to K-60operand type distribution, K-59overview, K-45 to K-47process protection, B-50

vs. RISC, 2, A-3segmented scheme, K-50system evolution, K-48top instructions, A-16typical operations, K-53variable encoding, A-22 to A-23virtualization issues, 128Virtual Machines ISA support, 109Virtual Machines and virtual

memory and I/O, 110Intel 8087, floating point remainder,

J-31Intel i860, K-16 to K-17, L-49, L-60Intel IA-32 architecture

call gate, B-54descriptor table, B-52instruction set complications, C-49

to C-51OCNs, F-3, F-70segment descriptors, B-53segmented virtual memory, B-51

to B-54Intel IA-64 architecture

compiler scheduling history, L-31conditional instructions, H-27explicit parallelism, H-34 to H-35historical background, L-32ISA

instruction formats, H-39instructions, H-35 to H-37instruction set basics, H-38overview, H-32 to H-33predication and speculation,

H-38 to H-40Itanium 2 processor

instruction latency, H-41overview, H-40 to H-41performance, H-43, H-43

multiple issue processor approaches, 194

parallelism exploitation statically, H-2

register model, H-33 to H-34RISC history, L-22software pipelining, H-15synchronization history, L-64

Intel iPSC 860, L-60Intel Itanium, sparse matrices, G-13Intel Itanium 2

“big and dumb” processors, 245clock rate, 244

IA-64functional units and instruction

issue, H-41 to H-43instruction latency, H-41overview, H-40 to H-41performance, H-43

peak performance, 58SPEC benchmarks, 43

Intelligent devices, historical background, L-80

Intel MMX, multimedia instruction compiler support, A-31 to A-32

Intel Nehalemcharacteristics, 411floorplan, 30WSC processor cost-performance,

473Intel Paragon, F-100, L-60Intel Pentium 4

hardware prefetching, 92Itanium 2 comparison, H-43multithreading history, L-35

Intel Pentium 4 Extreme, L-33 to L-34Intel Pentium II, L-33Intel Pentium III

pipelined cache access, 82power consumption, F-85

Intel Pentium M, power consumption, F-85

Intel Pentium MMX, multimedia support, E-11

Intel Pentium Pro, 82, L-33Intel Pentium processors

“big and dumb” processors, 245clock rate, 244early computer arithmetic, J-64 to

J-65vs. Opteron memory protection, B-57pipelining performance, C-10segmented virtual memory

example, B-51 to B-54SMT, 230

Intel processorsearly RISC designs, 2power consumption, F-85

Intel Single-Chip Cloud Computing (SCCC)

as interconnection example, F-70 to F-72

OCNs, F-3

Page 36: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-38 ■ Index

Intel Streaming SIMD Extension (SSE)

basic function, 283Multimedia SIMD Extensions,

A-31vs. vector architectures, 282

Intel Teraflops processors, OCNs, F-3Intel Thunder Tiger 4 QsNetII, F-63,

F-76Intel VT-x, 129Intel x86

Amazon Web Services, 456AVX instructions, 284clock rates, 244computer architecture, 15conditional instructions, H-27GPUs as coprocessors, 330–331Intel Core i7, 237–238Multimedia SIMD Extensions,

282–283NVIDIA GPU ISA, 298parallelism, 262–263performance and energy

efficiency, 241vs. PTX, 298RISC, 2speedup via parallelism, 263

Intel XeonAmazon Web Services, 457cache coherence, 361file system benchmarking, D-20InfiniBand, F-76multicore processor performance,

400–401performance, 400performance measurement,

405–406SMP limitations, 363SPECPower benchmarks, 463WSC processor cost-performance,

473Interactive workloads, WSC goals/

requirements, 433Interarrival times, queuing model,

D-30Interconnection networks

adaptive routing, F-93 to F-94adaptive routing and fault

tolerance, F-94arbitration, F-49, F-49 to F-50basic characteristics, F-2, F-20bisection bandwidth, F-89

commercialcongestion management, F-64

to F-66connectivity, F-62 to F-63cross-company interoperability,

F-63 to F-64DECstation 5000 reboots, F-69fault tolerance, F-66 to F-69

commercial routing/arbitration/switching, F-56

communication bandwidth, I-3compute-optimized processors vs.

receiver overhead, F-88density- vs. SPEC-optimized

processors, F-85device example, F-3direct vs. high-dimensional, F-92domains, F-3 to F5, F-4Ethernet, F-77 to F-79, F-78Ethernet/ATM total time statistics,

F-90examples, F-70HOL blocking, F-59IBM Blue Gene/L, I-43InfiniBand, F-75LAN history, F-99 to F-100link bandwidth, F-89memory hierarchy interface, F-87

to F-88mesh network routing, F-46MIN vs. direct network costs, F-92multicore single-chip

multiprocessor, 364multi-device connections

basic considerations, F-20 to F-21

effective bandwidth vs. nodes, F-28

latency vs. nodes, F-27performance characterization,

F-25 to F-29shared-media networks, F-22 to

F-24shared- vs. switched-media

networks, F-22switched-media networks, F-24topology, routing, arbitration,

switching, F-21 to F-22multi-device interconnections,

shared- vs. switched-media networks, F-24 to F-25

network-only features, F-94 to F-95

NIC vs. I/O subsystems, F-90 to F-91

OCN characteristics, F-73OCN example, F-70 to F-72OCN history, F-103 to F-104protection, F-86 to F-87routing, F-44 to F-48, F-54routing/arbitration/switching

impact, F-52 to F-55SAN characteristics, F-76software overhead, F-91 to F-92speed considerations, F-88storage area networks, F-102 to

F-103switching, F-50 to F-52switch microarchitecture, F-57

basic microarchitecture, F-55 to F-58

buffer organizations, F-58 to F-60

pipelining, F-60 to F-61, F-61switch vs. NIC, F-85 to F-86, F-86system area networks, F-72 to

F-74, F-100 to F-102system/storage area network, F-74

to F-77TCP/IP reliance, F-95top-level architecture, F-71topology, F-44

basic considerations, F-29 to F-30

Benes networks, F-33centralized switched networks,

F-30 to F-34, F-31direct networks, F-37distributed switched networks,

F-34 to F-40performance and costs, F-40performance effects, F-40 to

F-44ring network, F-36

two-device interconnectionsbasic considerations, F-5 to F-6effective bandwidth vs. packet

size, F-19example, F-6interface functions, F-6 to F-9performance, F-12 to F-20structure and functions, F-9 to

F-12

Page 37: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-39

virtual channels and throughput, F-93

WAN example, F-79WANs, F-97 to F-99wormhole switching performance,

F-92 to F-93zero-copy protocols, F-91

Intermittent faults, storage systems, D-11

Internal fragmentation, virtual memory page size selection, B-47

Internal Mask Registers, definition, 309

International Computer Architecture Symposium (ISCA), L-11 to L-12

International Mobile Telephony 2000 (IMT-2000), cell phone standards, E-25

InternetAmazon Web Services, 457array switch, 443cloud computing, 455–456, 461data-intensive applications, 344dependability, 33Google WSC, 464Layer 3 network linkage, 445Netflix traffic, 460SaaS, 4WSC efficiency, 452WSC memory hierarchy, 445WSCs, 432–433, 435, 437, 439,

446, 453–455Internet Archive Cluster

container history, L-74 to L-75overview, D-37performance, dependability, cost,

D-38 to D-40TB-80 cluster MTTF, D-40 to

D-41TB-80 VME rack, D-38

Internet Protocol (IP)internetworking, F-83storage area network history,

F-102WAN history, F-98

Internet Protocol (IP) cores, OCNs, F-3Internet Protocol (IP) routers, VOQs,

F-60Internetworking

connection example, F-80

cost, F-80definition, F-2enabling technologies, F-80 to

F-81OSI model layers, F-81, F-82protocol-level communication,

F-81 to F-82protocol stack, F-83, F-83role, F-81TCP/IP, F-81, F-83 to F-84TCP/IP headers, F-84

Interprocedural analysis, basic approach, H-10

Interprocessor communication, large-scale multiprocessors, I-3 to I-6

Interrupt, see ExceptionsInvalidate protocol

directory-based cache coherence protocol example, 382–383

example, 359, 360implementation, 356–357snooping coherence, 355, 355–356

Invalid exception, floating-point arithmetic, J-35

Inverted page table, virtual memory block identification, B-44 to B-45

I/O bandwidth, definition, D-15I/O benchmarks, response time

restrictions, D-18I/O bound workload, Virtual Machines

protection, 108I/O bus

historical background, L-80 to L-81interconnection networks, F-88point-to-point replacement, D-34Sony PlayStation 2 Emotion

Engine case study, E-15I/O cache coherency, basic

considerations, 113I/O devices

address translation, B-38average memory access time, B-17cache coherence enforcement, 354centralized shared-memory

multiprocessors, 351future GPU features, 332historical background, L-80 to

L-81

inclusion, B-34Multimedia SIMD vs. GPUs, 312multiprocessor cost effectiveness,

407performance, D-15 to D-16SANs, F-3 to F-4shared-media networks, F-23switched networks, F-2switch vs. NIC, F-86Virtual Machines impact, 110–111write strategy, B-11Xen VM, 111

I/O interfacesdisk storage, D-4storage area network history,

F-102I/O latency, shared-memory

workloads, 368–369, 371

I/O network, commercial interconnection network connectivity, F-63

IOP, see I/O processor (IOP)I/O processor (IOP)

first dynamic scheduling, L-27Sony PlayStation 2 Emotion

Engine case study, E-15I/O registers, write buffer merging, 87I/O subsystems

design, D-59 to D-61interconnection network speed,

F-88vs. NIC, F-90 to F-91zero-copy protocols, F-91

I/O systemsasynchronous, D-35as black box, D-23dirty bits, D-61 to D-64Internet Archive Cluster, see

Internet Archive Clustermultithreading history, L-34queing theory, D-23queue calculations, D-29random variable distribution, D-26utilization calculations, D-26

IP, see Intellectual Property (IP) cores; Internet Protocol (IP)

IPC, see Instructions per clock (IPC)IPoIB, F-77IR, see Instruction register (IR)ISA, see Instruction set architecture

(ISA)

Page 38: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-40 ■ Index

ISCA, see International Computer Architecture Symposium (ISCA)

iSCSINetApp FAS6000 filer, D-42storage area network history, F-102

Issue logicARM Cortex-A8, 233ILP, 197longer latency pipelines, C-57multiple issue processor, 198register renaming vs. ROB, 210speculation support, 210

Issue stageID pipe stage, 170instruction steps, 174MIPS with scoreboard, C-73 to C-74out-of-order execution, C-71ROB instruction, 186

Iterative division, floating point, J-27 to J-31

JJava benchmarks

Intel Core i7, 401–405SMT on superscalar processors,

230–232without SMT, 403–404

Java languagedependence analysis, H-10hardware impact on software

development, 4return address predictors, 206SMT, 230–232, 402–405SPECjbb, 40SPECpower, 52virtual functions/methods, A-18

Java Virtual Machine (JVM)early stack architectures, L-17IBM, 463multicore processor performance,

400multithreading-based speedup, 232SPECjbb, 53

JBOD, see RAID 0Johnson, Reynold B., L-77Jump prediction

hardware model, 214ideal processor, 214

Jumpscontrol flow instructions, 14, A-16,

A-17, A-21

GPU conditional branching, 301–302

MIPS control flow instructions, A-37 to A-38

MIPS operations, A-35return address predictors, 206RISC instruction set, C-5VAX, K-71 to K-72

Just-in-time (JIT), L-17JVM, see Java Virtual Machine (JVM)

KKahle, Brewster, L-74Kahn, Robert, F-97k-ary n-cubes, definition, F-38Kendall Square Research KSR-1, L-61Kernels

arithmetic intensity, 286, 286–287, 327

benchmarks, 56bytes per reference, vs. block size,

378caches, 329commercial workload, 369–370compilers, A-24compute bandwidth, 328via computing, 327EEMBC benchmarks, 38, E-12FFT, I-7FORTRAN, compiler

vectorization, G-15FP benchmarks, C-29Livermore Fortran kernels, 331LU, I-8multimedia instructions, A-31multiprocessor architecture, 408multiprogramming workload,

375–378, 377performance benchmarks, 37, 331primitives, A-30protecting processes, B-50segmented virtual memory, B-51SIMD exploitation, 330vector, on vector processor and

GPU, 334–336virtual memory protection, 106WSCs, 438

LL1 caches, see also First-level caches

address translation, B-46Alpha 21164 hierarchy, 368

ARM Cortex-A8, 116, 116, 235ARM Cortex-A8 vs. A9, 236ARM Cortex-A8 example, 117cache optimization, B-31 to B-33case study examples, B-60, B-63 to

B-64directory-based coherence, 418Fermi GPU, 306hardware prefetching, 91hit time/power reduction, 79–80inclusion, 397–398, B-34 to B-35Intel Core i7, 118–119, 121–122,

123, 124, 124, 239, 241invalidate protocol, 355, 356–357memory consistency, 392memory hierarchy, B-39miss rates, 376–377multiprocessor cache coherence,

352multiprogramming workload, 374nonblocking cache, 85NVIDIA GPU Memory, 304Opteron memory, B-57processor comparison, 242speculative execution, 223T1 multithreading unicore

performance, 228virtual memory, B-48 to B-49

L2 caches, see also Second-level caches

ARM Cortex-A8, 114, 115–116, 235–236

ARM Cortex-A8 example, 117cache optimization, B-31 to B-33,

B-34case study example, B-63 to B-64coherency, 352commercial workloads, 373directory-based coherence, 379,

418–420, 422, 424fault detection, 58Fermi GPU, 296, 306, 308hardware prefetching, 91IBM Blue Gene/L, I-42inclusion, 397–398, B-35Intel Core i7, 118, 120–122, 124,

124–125, 239, 241invalidation protocol, 355, 356–357and ISA, 241memory consistency, 392memory hierarchy, B-39, B-48,

B-57

Page 39: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-41

L2 caches (continued )multithreading, 225, 228nonblocking cache, 85NVIDIA GPU Memory, 304processor comparison, 242snooping coherence, 359–361speculation, 223

L3 caches, see also Third-level cachesAlpha 21164 hierarchy, 368coherence, 352commercial workloads, 370, 371,

374directory-based coherence, 379, 384IBM Blue Gene/L, I-42IBM Power processors, 247inclusion, 398Intel Core i7, 118, 121, 124,

124–125, 239, 241, 403–404

invalidation protocol, 355, 356–357, 360

memory access cycle shift, 372miss rates, 373multicore processors, 400–401multithreading, 225nonblocking cache, 83performance/price/power

considerations, 52snooping coherence, 359, 361, 363

LabVIEW, embedded benchmarks, E-13

Lampson, Butler, F-99Lanes

GPUs vs. vector architectures, 310Sequence of SIMD Lane

Operations, 292, 313SIMD Lane Registers, 309, 314SIMD Lanes, 296–297, 297,

302–303, 308, 309, 311–312, 314

vector execution time, 269vector instruction set, 271–273Vector Lane Registers, 292Vector Lanes, 292, 296–297, 309,

311LANs, see Local area networks

(LANs)Large-scale multiprocessors

cache coherence implementationdeadlock and buffering, I-38 to

I-40directory controller, I-40 to I-41

DSM multiprocessor, I-36 to I-37overview, I-34 to I-36

classification, I-45cluster history, L-62 to L-63historical background, L-60 to

L-61IBM Blue Gene/L, I-41 to I-44,

I-43 to I-44interprocessor communication, I-3

to I-6for parallel programming, I-2scientific application performance

distributed-memory multiprocessors, I-26 to I-32, I-28 to I-32

parallel processors, I-33 to I-34symmetric shared-memory

multiprocessor, I-21 to I-26, I-23 to I-25

scientific applications, I-6 to I-12space and relation of classes, I-46synchronization mechanisms, I-17

to I-21synchronization performance, I-12

to I-16Latency, see also Response time

advanced directory protocol case study, 425

vs. bandwidth, 18–19, 19barrier synchronization, I-16and cache miss, B-2 to B-3cluster history, L-73communication mechanism, I-3 to

I-4definition, D-15deterministic vs. adaptive routing,

F-52 to F-55directory coherence, 425distributed-memory

multiprocessors, I-30, I-32

dynamically scheduled pipelines, C-70 to C-71

Flash memory, D-3FP operations, 157FP pipeline, C-66functional units, C-53GPU SIMD instructions, 296GPUs vs. vector architectures, 311hazards and forwarding, C-54 to

C-58hiding with speculation, 396–397

ILP exposure, 157ILP without multithreading, 225ILP for realizable processors,

216–218Intel SCCC, F-70interconnection networks, F-12 to

F-20multi-device networks, F-25 to

F-29Itanium 2 instructions, H-41microarchitectural techniques case

study, 247–254MIPS pipeline FP operations, C-52

to C-53misses, single vs. multiple thread

executions, 228multimedia instruction compiler

support, A-31NVIDIA GPU Memory structures,

305OCNs vs. SANs, F-27out-of-order processors, B-20 to

B-21packets, F-13, F-14parallel processing, 350performance milestones, 20pipeline, C-87ROB commit, 187routing, F-50routing/arbitration/switching

impact, F-52routing comparison, F-54SAN example, F-73shared-memory workloads, 368snooping coherence, 414Sony PlayStation 2 Emotion

Engine, E-17Sun T1 multithreading, 226–229switched network topology, F-40

to F-41system area network history, F-101vs. TCP/IP reliance, F-95throughput vs. response time, D-17utility computing, L-74vector memory systems, G-9vector start-up, G-8WSC efficiency, 450–452WSC memory hierarchy, 443,

443–444, 444, 445WSC processor cost-performance,

472–473WSCs vs. datacenters, 456

Page 40: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-42 ■ Index

Layer 3 network, array and Internet linkage, 445

Layer 3 network, WSC memory hierarchy, 445

LCA, see Least common ancestor (LCA)

LCD, see Liquid crystal display (LCD)

Learning curve, cost trends, 27Least common ancestor (LCA),

routing algorithms, F-48Least recently used (LRU)

AMD Opteron data cache, B-12, B-14

block replacement, B-9memory hierarchy history, L-11virtual memory block replacement,

B-45Less than condition code, PowerPC,

K-10 to K-11Level 3, as Content Delivery Network,

460Limit field, IA-32 descriptor table,

B-52Line, memory hierarchy basics, 74Linear speedup

cost effectiveness, 407IBM eServer p5 multiprocessor,

408multicore processors, 400, 402performance, 405–406

Line locking, embedded systems, E-4 to E-5

Link injection bandwidthcalculation, F-17interconnection networks, F-89

Link pipelining, definition, F-16Link reception bandwidth, calculation,

F-17Link register

MIPS control flow instructions, A-37 to A-38

PowerPC instructions, K-32 to K-33

procedure invocation options, A-19

synchronization, 389Linpack benchmark

cluster history, L-63parallel processing debates, L-58vector processor example,

267–268

VMIPS performance, G-17 to G-19

Linux operating systemsAmazon Web Services, 456–457architecture costs, 2protection and ISA, 112RAID benchmarks, D-22, D-22 to

D-23WSC services, 441

Liquid crystal display (LCD), Sanyo VPC-SX500 digital camera, E-19

LISPRISC history, L-20SPARC instructions, K-30

LispILP, 215as MapReduce inspiration, 437

Literal addressing mode, basic considerations, A-10 to A-11

Little EndianIntel 80x86, K-49interconnection networks, F-12memory address interpretation,

A-7MIPS core extensions, K-20 to K-21MIPS data transfers, A-34

Little’s lawdefinition, D-24 to D-25server utilization calculation, D-29

Livelock, network routing, F-44Liveness, control dependence, 156Livermore Fortran kernels,

performance, 331, L-6LMD, see Load memory data (LMD)Load instructions

control dependences, 155data hazards requiring stalls, C-20dynamic scheduling, 177ILP, 199, 201loop-level parallelism, 318memory port conflict, C-14pipelined cache access, 82RISC instruction set, C-4 to C-5Tomasulo’s algorithm, 182VLIW sample code, 252

Load interlocksdefinition, C-37 to C-39detection logic, C-39

Load linkedlocks via coherence, 391

synchronization, 388–389Load locked, synchronization,

388–389Load memory data (LMD), simple

MIPS implementation, C-32 to C-33

Load stalls, MIPS R4000 pipeline, C-67

Load-store instruction set architecturebasic concept, C-4 to C-5IBM 360, K-87Intel Core i7, 124Intel 80x86 operations, K-62as ISA, 11ISA classification, A-5MIPS nonaligned data transfers,

K-24, K-26MIPS operations, A-35 to A-36,

A-36PowerPC, K-33RISC history, L-19simple MIPS implementation, C-32VMIPS, 265

Load/store unitFermi GPU, 305ILP hardware model, 215multiple lanes, 273Tomasulo’s algorithm, 171–173,

182, 197vector units, 265, 276–277

Load upper immediate (LUI), MIPS operations, A-37

Local address space, segmented virtual memory, B-52

Local area networks (LANs)characteristics, F-4cross-company interoperability, F-64effective bandwidth, F-18Ethernet as, F-77 to F-79fault tolerance calculations, F-68historical overview, F-99 to F-100InfiniBand, F-74interconnection network domain

relationship, F-4latency and effective bandwidth,

F-26 to F-28offload engines, F-8packet latency, F-13, F-14 to F-16routers/gateways, F-79shared-media networks, F-23storage area network history,

F-102 to F-103

Page 41: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-43

switches, F-29TCP/IP reliance, F-95time of flight, F-13topology, F-30

Locality, see Principle of localityLocal Memory

centralized shared-memory architectures, 351

definition, 292, 314distributed shared-memory, 379Fermi GPU, 306Grid mapping, 293multiprocessor architecture, 348NVIDIA GPU Memory structures,

304, 304–305SIMD, 315symmetric shared-memory

multiprocessors, 363–364

Local miss rate, definition, B-31Local node, directory-based cache

coherence protocol basics, 382

Local optimizations, compilers, A-26Local predictors, tournament

predictors, 164–166Local scheduling, ILP, VLIW

processor, 194–195Locks

via coherence, 389–391hardware primitives, 387large-scale multiprocessor

synchronization, I-18 to I-21

multiprocessor software development, 409

Lock-up free cache, 83Logical units, D-34

storage systems, D-34 to D-35Logical volumes, D-34Long displacement addressing, VAX,

K-67Long-haul networks, see Wide area

networks (WANs)Long Instruction Word (LIW)

EPIC, L-32multiple-issue processors, L-28,

L-30Long integer

operand sizes/types, 12SPEC benchmarks, A-14

Loop-carried dependences

CUDA, 290definition, 315–316dependence distance, H-6dependent computation

elimination, 321example calculations, H-4 to H-5GCD, 319loop-level parallelism, H-3as recurrence, 318recurrence form, H-5VMIPS, 268

Loop exit predictor, Intel Core i7, 166Loop interchange, compiler

optimizations, 88–89Loop-level parallelism

definition, 149–150detection and enhancement

basic approach, 315–318dependence analysis, H-6 to

H-10dependence computation

elimination, 321–322dependences, locating,

318–321dependent computation

elimination, H-10 to H-12

overview, H-2 to H-6history, L-30 to L-31ILP in perfect processor, 215ILP for realizable processors,

217–218Loop stream detection, Intel Core i7

micro-op buffer, 238Loop unrolling

basic considerations, 161–162ILP exposure, 157–161ILP limitation studies, 220recurrences, H-12software pipelining, H-12 to H-15,

H-13, H-15Tomasulo’s algorithm, 179,

181–183VLIW processors, 195

Lossless networksdefinition, F-11 to F-12switch buffer organizations, F-59

Lossy networks, definition, F-11 to F-12

LRU, see Least recently used (LRU)Lucas

compiler optimizations, A-29

data cache misses, B-10LUI, see Load upper immediate (LUI)LU kernel

characteristics, I-8distributed-memory

multiprocessor, I-32symmetric shared-memory

multiprocessors, I-22, I-23, I-25

MMAC, see Multiply-accumulate

(MAC)Machine language programmer, L-17

to L-18Machine memory, Virtual Machines,

110Macro-op fusion, Intel Core i7,

237–238Magnetic storage

access time, D-3cost vs. access time, D-3historical background, L-77 to

L-79Mail servers, benchmarking, D-20Main Memory

addressing modes, A-10address translation, B-46arithmetic intensity example, 286,

286–288block placement, B-44cache function, B-2cache optimization, B-30, B-36coherence protocol, 362definition, 292, 309DRAM, 17gather-scatter, 329GPU vs. MIMD, 327GPUs and coprocessors, 330GPU threads, 332ILP considerations, 245interlane wiring, 273linear speedups, 407memory hierarchy basics, 76memory hierarchy design, 72memory mapping, B-42MIPS operations, A-36Multimedia SIMD vs. GPUs, 312multiprocessor cache coherence,

352paging vs. segmentation, B-43partitioning, B-50

Page 42: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-44 ■ Index

Main Memory (continued )processor performance

calculations, 218–219RISC code size, A-23server energy efficiency, 462symmetric shared-memory

multiprocessors, 363vector processor, G-25vs. virtual memory, B-3, B-41virtual memory block

identification, B-44 to B-45

virtual memory writes, B-45 to B-46

VLIW, 196write-back, B-11write process, B-45

Manufacturing costchip fabrication case study, 61–62cost trends, 27modern processors, 62vs. operation cost, 33

MapReducecloud computing, 455cost calculations, 458–460, 459Google usage, 437reductions, 321WSC batch processing, 437–438WSC cost-performance, 474

Mark-I, L-3 to L-4, L-6Mark-II, L-4Mark-III, L-4Mark-IV, L-4Mask Registers

basic operation, 275–276definition, 309Multimedia SIMD, 283NVIDIA GPU computational

structures, 291vector compilers, 303vector vs. GPU, 311VMIPS, 267

MasPar, L-44Massively parallel processors (MPPs)

characteristics, I-45cluster history, L-62, L-72 to

L-73system area network history, F-100

to F-101Matrix300 kernel

definition, 56prediction buffer, C-29

Matrix multiplicationbenchmarks, 56LU kernel, I-8multidimensional arrays in vector

architectures, 278Mauchly, John, L-2 to L-3, L-5, L-19Maximum transfer unit, network

interfaces, F-7 to F-8Maximum vector length (MVL)

Multimedia SIMD extensions, 282vector vs. GPU, 311VLRs, 274–275

M-bus, see Memory bus (M-bus)McCreight, Ed, F-99MCF

compiler optimizations, A-29data cache misses, B-10Intel Core i7, 240–241

MCP operating system, L-16Mean time between failures (MTBF)

fallacies, 56–57RAID, L-79SLA states, 34

Mean time to failure (MTTF)computer system power

consumption case study, 63–64

dependability benchmarks, D-21disk arrays, D-6example calculations, 34–35I/O subsystem design, D-59 to

D-61RAID reconstruction, D-55 to

D-57SLA states, 34TB-80 cluster, D-40 to D-41WSCs vs. servers, 434

Mean time to repair (MTTR)dependability benchmarks, D-21disk arrays, D-6RAID 6, D-8 to D-9RAID reconstruction, D-56

Mean time until data loss (MTDL), RAID reconstruction, D-55 to D-57

Media, interconnection networks, F-9 to F-12

Media extensions, DSPs, E-10 to E-11Mellanox MHEA28-XT, F-76Memory access

ARM Cortex-A8 example, 117basic MIPS pipeline, C-36

vs. block size, B-28cache hit calculation, B-5 to B-6Cray Research T3D, F-87data hazards requiring stalls, C-19

to C-21data hazard stall minimization,

C-17, C-19distributed-memory

multiprocessor, I-32exception stopping/restarting, C-46hazards and forwarding, C-56 to

C-57instruction set complications, C-49integrated instruction fetch units,

208MIPS data transfers, A-34MIPS exceptions, C-48 to C-49MIPS pipeline control, C-37 to

C-39MIPS R4000, C-65multimedia instruction compiler

support, A-31pipeline branch issues, C-40, C-42RISC classic pipeline, C-7, C-10shared-memory workloads, 372simple MIPS implementation,

C-32 to C-33simple RISC implementation, C-6structural hazards, C-13 to C-14vector architectures, G-10

Memory addressingALU immediate operands, A-12basic considerations, A-11 to A-13compiler-based speculation, H-32displacement values, A-12immediate value distribution, A-13interpretation, A-7 to A-8ISA, 11vector architectures, G-10

Memory banks, see also Banked memory

gather-scatter, 280multiprocessor architecture, 347parallelism, 45shared-memory multiprocessors,

363strides, 279vector load/store unit bandwidth,

276–277vector systems, G-9 to G-11

Memory bus (M-bus)definition, 351

Page 43: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-45

Google WSC servers, 469interconnection networks, F-88

Memory consistencybasic considerations, 392–393cache coherence, 352compiler optimization, 396development of models, L-64directory-based cache coherence

protocol basics, 382multiprocessor cache coherency,

353relaxed consistency models,

394–395single-chip multicore processor

case study, 412–418speculation to hide latency, 396–397

Memory-constrained scaling, scientific applications on parallel processors, I-33

Memory hierarchyaddress space, B-57 to B-58basic questions, B-6 to B-12block identification, B-7 to B-9block placement issues, B-7block replacement, B-9 to B-10cache optimization

basic categories, B-22basic optimizations, B-40hit time reduction, B-36 to

B-40miss categories, B-23 to B-26miss penalty reduction

via multilevel caches, B-30 to B-35

read misses vs. writes, B-35 to B-36

miss rate reductionvia associativity, B-28 to

B-30via block size, B-26 to B-28via cache size, B-28

pipelined cache access, 82cache performance, B-3 to B-6

average memory access time, B-17 to B-20

basic considerations, B-16basic equations, B-22example calculation, B-16out-of-order processors, B-20

to B-22case studies, B-60 to B-67

development, L-9 to L-12inclusion, 397–398interconnection network

protection, F-87 to F-88levels in slow down, B-3Opteron data cache example, B-12

to B-15, B-13Opteron L1/L2, B-57OS and page size, B-58overview, B-39Pentium vs. Opteron protection,

B-57processor examples, B-3process protection, B-50terminology, B-2 to B-3virtual memory

basic considerations, B-40 to B-44, B-48 to B-49

basic questions, B-44 to B-46fast address translation, B-46overview, B-48paged example, B-54 to B-57page size selection, B-46 to

B-47segmented example, B-51 to

B-54write strategy, B-10 to B-12WSCs, 443, 443–446, 444

Memory hierarchy designaccess times, 77Alpha 21264 floorplan, 143ARM Cortex-A8 example,

114–117, 115–117cache coherency, 112–113cache optimization

case study, 131–133compiler-controlled

prefetching, 92–95compiler optimizations, 87–90critical word first, 86–87energy consumption, 81hardware instruction

prefetching, 91–92, 92multibanked caches, 85–86, 86nonblocking caches, 83–85, 84overview, 78–79pipelined cache access, 82techniques overview, 96way prediction, 81–82write buffer merging, 87, 88

cache performance prediction, 125–126

cache size and misses per instruction, 126

DDR2 SDRAM timing diagram, 139

highly parallel memory systems, 133–136

high memory bandwidth, 126instruction miss benchmarks, 127instruction simulation, 126Intel Core i7, 117–124, 119,

123–125Intel Core i7 three-level cache

hierarchy, 118Intel Core i7 TLB structure, 118Intel 80x86 virtualization issues,

128memory basics, 74–78overview, 72–74protection and ISA, 112server vs. PMD, 72system call virtualization/

paravirtualization performance, 141

virtual machine monitor, 108–109Virtual Machines ISA support,

109–110Virtual Machines protection,

107–108Virtual Machines and virtual

memory and I/O, 110–111

virtual memory protection, 105–107

VMM on nonvirtualizable ISA, 128–129

Xen VM example, 111Memory Interface Unit

NVIDIA GPU ISA, 300vector processor example, 310

Memoryless, definition, D-28Memory mapping

memory hierarchy, B-48 to B-49segmented virtual memory, B-52TLBs, 323virtual memory definition, B-42

Memory-memory instruction set architecture, ISA classification, A-3, A-5

Memory protectioncontrol dependence, 155Pentium vs. Opteron, B-57processes, B-50

Page 44: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-46 ■ Index

Memory protection (continued )safe calls, B-54segmented virtual memory

example, B-51 to B-54virtual memory, B-41

Memory stall cyclesaverage memory access time, B-17definition, B-4 to B-5miss rate calculation, B-6out-of-order processors, B-20 to

B-21performance equations, B-22

Memory systemcache optimization, B-36coherency, 352–353commercial workloads, 367,

369–371computer architecture, 15C program evaluation, 134–135dependability enhancement,

104–105distributed shared-memory, 379, 418gather-scatter, 280GDRAMs, 323GPUs, 332ILP, 245

hardware vs. software speculation, 221–222

speculative execution, 222–223Intel Core i7, 237, 242latency, B-21MIPS, C-33multiprocessor architecture, 347multiprocessor cache coherence,

352multiprogramming workload,

377–378page size changes, B-58price/performance/power

considerations, 53RISC, C-7Roofline model, 286shared-memory multiprocessors,

363SMT, 399–400stride handling, 279T1 multithreading unicore

performance, 227vector architectures, G-9 to G-11vector chaining, G-11vector processors, 271, 277virtual, B-43, B-46

Memory technology basicsDRAM, 98, 98–100, 99DRAM and DIMM characteristics,

101DRAM performance, 100–102Flash memory, 102–104overview, 96–97performance trends, 20SDRAM power consumption, 102,

103SRAM, 97–98

Mesh interface unit (MIU), Intel SCCC, F-70

Mesh networkcharacteristics, F-73deadlock, F-47dimension-order routing, F-47 to

F-48OCN history, F-104routing example, F-46

Mesh topologycharacteristics, F-36direct networks, F-37NEWS communication, F-42 to

F-43MESI, see Modified-Exclusive-

Shared-Invalid (MESI) protocol

Message ID, packet header, F-8, F-16Message-passing communication

historical background, L-60 to L-61

large-scale multiprocessors, I-5 to I-6

Message Passing Interface (MPI)function, F-8InfiniBand, F-77lack in shared-memory

multiprocessors, I-5Messages

adaptive routing, F-93 to F-94coherence maintenance, 381InfiniBand, F-76interconnection networks, F-6 to

F-9zero-copy protocols, F-91

MFLOPS, see Millions of floating-point operations per second (MFLOPS)

Microarchitectureas architecture component, 15–16

ARM Cortex-A8, 241Cray X1, G-21 to G-22data hazards, 168ILP exploitation, 197Intel Core i7, 236–237Nehalem, 411OCNs, F-3out-of-order example, 253PTX vs. x86, 298switches, see Switch

microarchitecturetechniques case study, 247–254

Microbenchmarksdisk array deconstruction, D-51 to

D-55disk deconstruction, D-48 to D-51

Microfusion, Intel Core i7 micro-op buffer, 238

Microinstructionscomplications, C-50 to C-51x86, 298

Micro-opsIntel Core i7, 237, 238–240, 239processor clock rates, 244

Microprocessor overviewclock rate trends, 24cost trends, 27–28desktop computers, 6embedded computers, 8–9energy and power, 23–26inside disks, D-4integrated circuit improvements, 2and Moore’s law, 3–4performance trends, 19–20, 20power and energy system trends,

21–23recent advances, L-33 to L-34technology trends, 18

Microprocessor without Interlocked Pipeline Stages, see MIPS (Microprocessor without Interlocked Pipeline Stages)

Microsoftcloud computing, 455containers, L-74Intel support, 245WSCs, 464–465

Microsoft Azure, 456, L-74Microsoft DirectX, L-51 to L-52Microsoft Windows

benchmarks, 38

Page 45: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-47

multithreading, 223RAID benchmarks, D-22, D-22 to

D-23time/volume/commoditization

impact, 28WSC workloads, 441

Microsoft Windows 2008 Serverreal-world considerations, 52–55SPECpower benchmark, 463

Microsoft XBox, L-51Migration, cache coherent

multiprocessors, 354Millions of floating-point operations

per second (MFLOPS)early performance measures, L-7parallel processing debates, L-57

to L-58SIMD computer history, L-55SIMD supercomputer

development, L-43vector performance measures,

G-15 to G-16MIMD (Multiple Instruction Streams,

Multiple Data Streams)and Amdahl’s law, 406–407definition, 10early computers, L-56first vector computers, L-46, L-48GPU programming, 289GPUs vs. vector architectures, 310with Multimedia SIMD, vs. GPU,

324–330multiprocessor architecture,

346–348speedup via parallelism, 263TLP, basic considerations, 344–345

Minicomputers, replacement by microprocessors, 3–4

Minniespec benchmarksARM Cortex-A8, 116, 235ARM Cortex-A8 memory,

115–116MINs, see Multistage interconnection

networks (MINs)MIPS (Microprocessor without

Interlocked Pipeline Stages)

addressing modes, 11–12basic pipeline, C-34 to C-36branch predictor correlation, 163cache performance, B-6conditional branches, K-11

conditional instructions, H-27control flow instructions, 14data dependences, 151data hazards, 169dynamic scheduling with

Tomasulo’s algorithm, 171, 173

early pipelined CPUs, L-26embedded systems, E-15encoding, 14exceptions, C-48, C-48 to C-49exception stopping/restarting, C-46

to C-47features, K-44FP pipeline performance, C-60 to

C-61, C-62FP unit with Tomasulo’s

algorithm, 173hazard checks, C-71ILP, 149ILP exposure, 157–158ILP hardware model, 215instruction execution issues, K-81instruction formats, core

instructions, K-6instruction set complications, C-49

to C-51ISA class, 11ISA example

addressing modes for data transfer, A-34

arithmetic/logical instructions, A-37

basic considerations, A-32 to A-33

control flow instructions, A-37 to A-38, A-38

data types, A-34dynamic instruction mix, A-41,

A-41 to A-42, A-42FP operations, A-38 to A-39instruction format, A-35load-store instructions, A-36MIPS operations, A-35 to A-37registers, A-34usage, A-39

Livermore Fortran kernel performance, 331

memory addressing, 11multicycle operations

basic considerations, C-51 to C-54

hazards and forwarding, C-54 to C-58

precise exceptions, C-58 to C-60

multimedia support, K-19multiple-issue processor history,

L-29operands, 12performance measurement history,

L-6 to L-7pipeline branch issues, C-39 to

C-42pipeline control, C-36 to C-39pipe stage, C-37processor performance

calculations, 218–219registers and usage conventions, 12RISC code size, A-23RISC history, L-19RISC instruction set lineage, K-43as RISC systems, K-4scoreboard components, C-76scoreboarding, C-72scoreboarding steps, C-73, C-73 to

C-74simple implementation, C-31 to

C-34, C-34Sony PlayStation 2 Emotion

Engine, E-17unaligned word read instructions,

K-26unpipelined functional units, C-52vs. VAX, K-65 to K-66, K-75,

K-82write strategy, B-10

MIPS16addressing modes, K-6arithmetic/logical instructions,

K-24characteristics, K-4constant extension, K-9data transfer instructions, K-23embedded instruction format, K-8instructions, K-14 to K-16multiply-accumulate, K-20RISC code size, A-23unique instructions, K-40 to K-42

MIPS32, vs. VAX sort, K-80MIPS64

addressing modes, K-5arithmetic/logical instructions,

K-11

Page 46: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-48 ■ Index

MIPS64 (continued )conditional branches, K-17constant extension, K-9conventions, K-13data transfer instructions, K-10FP instructions, K-23instruction list, K-26 to K-27instruction set architecture formats,

14instruction subset, 13, A-40in MIPS R4000, C-61nonaligned data transfers, K-24 to

K-26RISC instruction set, C-4

MIPS2000, instruction benchmarks, K-82

MIPS 3010, chip layout, J-59MIPS core

compare and conditional branch, K-9 to K-16

equivalent RISC instructionsarithmetic/logical, K-11arithmetic/logical instructions,

K-15common extensions, K-19 to

K-24control instructions, K-12, K-16conventions, K-16data transfers, K-10embedded RISC data transfers,

K-14FP instructions, K-13

instruction formats, K-9MIPS M2000, L-21, L-21MIPS MDMX

characteristics, K-18multimedia support, K-18

MIPS R2000, L-20MIPS R3000

integer arithmetic, J-12integer overflow, J-11

MIPS R3010arithmetic functions, J-58 to J-61chip comparison, J-58floating-point exceptions, J-35

MIPS R4000early pipelined CPUs, L-27FP pipeline, C-65 to C-67, C-66integer pipeline, C-63pipeline overview, C-61 to C-65pipeline performance, C-67 to

C-70

pipeline structure, C-62 to C-63MIPS R8000, precise exceptions, C-59MIPS R10000, 81

latency hiding, 397precise exceptions, C-59

Misalignment, memory address interpretation, A-7 to A-8, A-8

MISD, see Multiple Instruction Streams, Single Data Stream

Misprediction ratebranch-prediction buffers, C-29predictors on SPEC89, 166profile-based predictor, C-27SPECCPU2006 benchmarks, 167

MispredictionsARM Cortex-A8, 232, 235branch predictors, 164–167, 240,

C-28branch-target buffers, 205hardware-based speculation, 190hardware vs. software speculation,

221integer vs. FP programs, 212Intel Core i7, 237prediction buffers, C-29static branch prediction, C-26 to

C-27Misses per instruction

application/OS statistics, B-59cache performance, B-5 to B-6cache protocols, 359cache size effect, 126L3 cache block size, 371memory hierarchy basics, 75performance impact calculations,

B-18shared-memory workloads, 372SPEC benchmarks, 127strided access-TLB interactions,

323Miss penalty

average memory access time, B-16 to B-17

cache optimization, 79, B-35 to B-36

cache performance, B-4, B-21compiler-controlled prefetching,

92–95critical word first, 86–87hardware prefetching, 91–92

ILP speculative execution, 223memory hierarchy basics, 75–76nonblocking cache, 83out-of-order processors, B-20 to

B-22processor performance

calculations, 218–219reduction via multilevel caches,

B-30 to B-35write buffer merging, 87

Miss rateAMD Opteron data cache, B-15ARM Cortex-A8, 116average memory access time, B-16

to B-17, B-29basic categories, B-23vs. block size, B-27cache optimization, 79

and associativity, B-28 to B-30and block size, B-26 to B-28and cache size, B-28

cache performance, B-4and cache size, B-24 to B-25compiler-controlled prefetching,

92–95compiler optimizations, 87–90early IBM computers, L-10 to L-11example calculations, B-6, B-31 to

B-32hardware prefetching, 91–92Intel Core i7, 123, 125, 241memory hierarchy basics, 75–76multilevel caches, B-33processor performance

calculations, 218–219scientific workloads

distributed-memory multiprocessors, I-28 to I-30

symmetric shared-memory multiprocessors, I-22, I-23 to I-25

shared-memory multiprogramming workload, 376, 376–377

shared-memory workload, 370–373

single vs. multiple thread executions, 228

Sun T1 multithreading unicore performance, 228

vs. virtual addressed cache size, B-37

Page 47: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-49

MIT Raw, characteristics, F-73Mitsubishi M32R

addressing modes, K-6arithmetic/logical instructions,

K-24characteristics, K-4condition codes, K-14constant extension, K-9data transfer instructions, K-23embedded instruction format, K-8multiply-accumulate, K-20unique instructions, K-39 to K-40

MIU, see Mesh interface unit (MIU)Mixed cache

AMD Opteron example, B-15commercial workload, 373

Mixer, radio receiver, E-23Miya, Eugene, L-65M/M/1 model

example, D-32, D-32 to D-33overview, D-30RAID performance prediction, D-57sample calculations, D-33

M/M/2 model, RAID performance prediction, D-57

MMX, see Multimedia Extensions (MMX)

Mobile clientsdata usage, 3GPU features, 324vs. server GPUs, 323–330

Modified-Exclusive-Shared-Invalid (MESI) protocol, characteristics, 362

Modified-Owned-Exclusive-Shared-Invalid (MOESI) protocol, characteristics, 362

Modified statecoherence protocol, 362directory-based cache coherence

protocol basics, 380large-scale multiprocessor cache

coherence, I-35snooping coherence protocol,

358–359Modula-3, integer division/remainder,

J-12Module availability, definition, 34Module reliability, definition, 34MOESI, see Modified-Owned-

Exclusive-Shared-Invalid (MOESI) protocol

Moore’s lawDRAM, 100flawed architectures, A-45interconnection networks, F-70and microprocessor dominance,

3–4point-to-point links and switches,

D-34RISC, A-3RISC history, L-22software importance, 55switch size, F-29technology trends, 17

Mortar shot graphs, multiprocessor performance measurement, 405–406

Motion JPEG encoder, Sanyo VPC-SX500 digital camera, E-19

Motorola 68000characteristics, K-42memory protection, L-10

Motorola 68882, floating-point precisions, J-33

Move address, VAX, K-70MPEG

Multimedia SIMD Extensions history, L-49

multimedia support, K-17Sanyo VPC-SX500 digital camera,

E-19Sony PlayStation 2 Emotion

Engine, E-17MPI, see Message Passing Interface

(MPI)MPPs, see Massively parallel

processors (MPPs)MSP, see Multi-Streaming Processor

(MSP)MTBF, see Mean time between

failures (MTBF)MTDL, see Mean time until data loss

(MTDL)MTTF, see Mean time to failure (MTTF)MTTR, see Mean time to repair (MTTR)Multibanked caches

cache optimization, 85–86example, 86

Multichip modules, OCNs, F-3Multicomputers

cluster history, L-63definition, 345, L-59

historical background, L-64 to L-65

Multicore processorsarchitecture goals/requirements, 15cache coherence, 361–362centralized shared-memory

multiprocessor structure, 347

Cray X1E, G-24directory-based cache coherence,

380directory-based coherence, 381,

419DSM architecture, 348, 379multichip

cache and memory states, 419with DSM, 419

multiprocessors, 345OCN history, F-104performance, 400–401, 401performance gains, 398–400performance milestones, 20single-chip case study, 412–418and SMT, 404–405snooping cache coherence

implementation, 365SPEC benchmarks, 402uniform memory access, 364write invalidate protocol

implementation, 356–357

Multics protection software, L-9Multicycle operations, MIPS pipeline

basic considerations, C-51 to C-54hazards and forwarding, C-54 to

C-58precise exceptions, C-58 to C-60

Multidimensional arraysdependences, 318in vector architectures, 278–279

Multiflow processor, L-30, L-32Multigrid methods, Ocean application,

I-9 to I-10Multilevel caches

cache optimizations, B-22centralized shared-memory

architectures, 351memory hierarchy basics, 76memory hierarchy history, L-11miss penalty reduction, B-30 to

B-35miss rate vs. cache size, B-33

Page 48: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-50 ■ Index

Multilevel caches (continued)Multimedia SIMD vs. GPU, 312performance equations, B-22purpose, 397write process, B-11

Multilevel exclusion, definition, B-35Multilevel inclusion

definition, 397, B-34implementation, 397memory hierarchy history, L-11

Multimedia applicationsdesktop processor support,

E-11GPUs, 288ISA support, A-46MIPS FP operations, A-39vector architectures, 267

Multimedia Extensions (MMX)compiler support, A-31desktop RISCs, K-18desktop/server RISCs, K-16 to

K-19SIMD history, 262, L-50vs. vector architectures, 282–283

Multimedia instructionsARM Cortex-A8, 236compiler support, A-31 to A-32

Multimedia SIMD Extensionsbasic considerations, 262, 282–284compiler support, A-31DLP, 322DSPs, E-11vs. GPUs, 312historical background, L-49 to

L-50MIMD, vs. GPU, 324–330parallelism classes, 10programming, 285Roofline visual performance

model, 285–288, 287256-bit-wide operations, 282vs. vector, 263–264

Multimedia user interfaces, PMDs, 6Multimode fiber, interconnection

networks, F-9Multipass array multiplier, example,

J-51Multiple Instruction Streams, Multiple

Data Streams, see MIMD (Multiple Instruction Streams, Multiple Data Streams)

Multiple Instruction Streams, Single Data Stream (MISD), definition, 10

Multiple-issue processorsbasic VLIW approach, 193–196with dynamic scheduling and

speculation, 197–202early development, L-28 to L-30instruction fetch bandwidth,

202–203integrated instruction fetch units,

207loop unrolling, 162microarchitectural techniques case

study, 247–254primary approaches, 194SMT, 224, 226with speculation, 198Tomasulo’s algorithm, 183

Multiple lanes techniquevector instruction set, 271–273vector performance, G-7 to G-9vector performance calculations,

G-8Multiple paths, ILP limitation studies,

220Multiple-precision addition, J-13Multiply-accumulate (MAC)

DSP, E-5embedded RISCs, K-20TI TMS320C55 DSP, E-8

Multiply operationschip comparison, J-61floating point

denormals, J-20 to J-21examples, J-19multiplication, J-17 to J-20precision, J-21rounding, J-18, J-19

integer arithmeticarray multiplier, J-50Booth recoding, J-49even/odd array, J-52issues, J-11with many adders, J-50 to J-54multipass array multiplier, J-51n-bit unsigned integers, J-4Radix-2, J-4 to J-7signed-digit addition table,

J-54with single adder, J-47 to J-49,

J-48

Wallace tree, J-53integer shifting over zeros, J-45 to

J-47PA-RISC instructions, K-34 to

K-35unfinished instructions, 179

Multiprocessor basicsarchitectural issues and

approaches, 346–348architecture goals/requirements, 15architecture and software

development, 407–409basic hardware primitives,

387–389cache coherence, 352–353coining of term, L-59communication calculations, 350computer categories, 10consistency models, 395definition, 345early machines, L-56embedded systems, E-14 to E-15fallacies, 55locks via coherence, 389–391low-to-high-end roles, 344–345parallel processing challenges,

349–351for performance gains, 398–400performance trends, 21point-to-point example, 413shared-memory, see

Shared-memory multiprocessors

SMP, 345, 350, 354–355, 363–364streaming Multiprocessor, 292,

307, 313–314Multiprocessor history

bus-based coherent multiprocessors, L-59 to L-60

clusters, L-62 to L-64early computers, L-56large-scale multiprocessors, L-60

to L-61parallel processing debates, L-56

to L-58recent advances and developments,

L-58 to L-60SIMD computers, L-55 to L-56synchronization and consistency

models, L-64virtual memory, L-64

Page 49: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-51

Multiprogrammingdefinition, 345multithreading, 224performance, 36shared-memory workload

performance, 375–378, 377

shared-memory workloads, 374–375

software optimization, 408virtual memory-based protection,

105–106, B-49workload execution time, 375

Multistage interconnection networks (MINs)

bidirectional, F-33 to F-34crossbar switch calculations, F-31

to F-32vs. direct network costs, F-92example, F-31self-routing, F-48system area network history, F-100

to F-101topology, F-30 to F-31, F-38 to

F-39Multistage switch fabrics, topology,

F-30Multi-Streaming Processor (MSP)

Cray X1, G-21 to G-23, G-22, G-23 to G-24

Cray X1E, G-24first vector computers, L-46

Multithreaded SIMD Processorblock diagram, 294definition, 292, 309, 313–314Fermi GPU architectural

innovations, 305–308Fermi GPU block diagram, 307Fermi GTX 480 GPU floorplan,

295, 295–296GPU programming, 289–290GPUs vs. vector architectures, 310,

310–311Grid mapping, 293NVIDIA GPU computational

structures, 291NVIDIA GPU Memory structures,

304, 304–305Roofline model, 326

Multithreaded vector processordefinition, 292Fermi GPU comparison, 305

Multithreadingcoarse-grained, 224–226definition and types, 223–225fine-grained, 224–226GPU programming, 289historical background, L-34 to

L-35ILP, 223–232memory hierarchy basics, 75–76parallel benchmarks, 231, 231–232for performance gains, 398–400SMT, see Simultaneous

multithreading (SMT)Sun T1 effectiveness, 226–229

MVAPICH, F-77MVL, see Maximum vector length

(MVL)MXP processor, components, E-14Myrinet SAN, F-67

characteristics, F-76cluster history, L-62 to L-63, L-73routing algorithms, F-48switch vs. NIC, F-86system area network history, F-100

NNAK, see Negative acknowledge

(NAK)Name dependences

ILP, 152–153locating dependences, 318–319loop-level parallelism, 315scoreboarding, C-79Tomasulo’s algorithm, 171–172

Nameplate power rating, WSCs, 449NaN (Not a Number), J-14, J-16, J-21,

J-34NAND Flash, definition, 103NAS, see Network attached storage

(NAS)NAS Parallel Benchmarks

InfiniBand, F-76vector processor history, G-28

National Science Foundation, WAN history, F-98

Natural parallelismembedded systems, E-15multiprocessor importance, 344multithreading, 223

n-bit adder, carry-lookahead, J-38n-bit number representation, J-7 to

J-10

n-bit unsigned integer division, J-4N-body algorithms, Barnes

application, I-8 to I-9NBS DYSEAC, L-81N-cube topology, characteristics, F-36NEC Earth Simulator, peak

performance, 58NEC SX/2, L-45, L-47NEC SX/5, L-46, L-48NEC SX/6, L-46, L-48NEC SX-8, L-46, L-48NEC SX-9

first vector computers, L-49Roofline model, 286–288, 287

NEC VR 4122, embedded benchmarks, E-13

Negative acknowledge (NAK)cache coherence, I-38 to I-39directory controller, I-40 to I-41DSM multiprocessor cache

coherence, I-37Negative condition code, MIPS core,

K-9 to K-16Negative-first routing, F-48Nested page tables, 129NetApp, see Network Appliance

(NetApp)Netflix, AWS, 460Netscape, F-98Network Appliance (NetApp)

FAS6000 filer, D-41 to D-42NFS benchmarking, D-20RAID, D-9RAID row-diagonal parity, D-9

Network attached storage (NAS)block servers vs. filers, D-35WSCs, 442

Network bandwidth, interconnection network, F-18

Network-Based Computer Laboratory (Ohio State), F-76, F-77

Network buffers, network interfaces, F-7 to F-8

Network fabric, switched-media networks, F-24

Network File System (NFS)benchmarking, D-20, D-20block servers vs. filers, D-35interconnection networks, F-89server benchmarks, 40TCP/IP, F-81

Page 50: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-52 ■ Index

Networking costs, WSC vs. datacenters, 455

Network injection bandwidthinterconnection network, F-18multi-device interconnection

networks, F-26Network interface

fault tolerance, F-67functions, F-6 to F-7message composition/processing,

F-6 to F-9Network interface card (NIC)

functions, F-8Google WSC servers, 469vs. I/O subsystem, F-90 to F-91storage area network history,

F-102vs. switches, F-85 to F-86, F-86zero-copy protocols, F-91

Network layer, definition, F-82Network nodes

direct network topology, F-37distributed switched networks,

F-34 to F-36Network on chip (NoC),

characteristics, F-3Network ports, interconnection

network topology, F-29Network protocol layer,

interconnection networks, F-10

Network reception bandwidth, interconnection network, F-18

Network reconfigurationcommercial interconnection

networks, F-66fault tolerance, F-67switch vs. NIC, F-86

Network technology, see also Interconnection networks

Google WSC, 469performance trends, 19–20personal computers, F-2trends, 18WSC bottleneck, 461WSC goals/requirements, 433

Network of Workstations, L-62, L-73NEWS communication, see

North-East-West-South communication

Newton’s iteration, J-27 to J-30NFS, see Network File System (NFS)NIC, see Network interface card (NIC)Nicely, Thomas, J-64NMOS, DRAM, 99NoC, see Network on chip (NoC)Nodes

coherence maintenance, 381communication bandwidth, I-3direct network topology, F-37directory-based cache coherence,

380distributed switched networks,

F-34 to F-36IBM Blue Gene/L, I-42 to I-44IBM Blue Gene/L 3D torus

network, F-73network topology performance and

costs, F-40in parallel, 336points-to analysis, H-9

Nokia cell phone, circuit board, E-24Nonaligned data transfers, MIPS64,

K-24 to K-26Nonatomic operations

cache coherence, 361directory protocol, 386

Nonbinding prefetch, cache optimization, 93

Nonblocking cachescache optimization, 83–85,

131–133effectiveness, 84ILP speculative execution,

222–223Intel Core i7, 118memory hierarchy history, L-11

Nonblocking crossbar, centralized switched networks, F-32 to F-33

Nonfaulting prefetches, cache optimization, 92

Nonrestoring division, J-5, J-6Nonuniform memory access

(NUMA)DSM as, 348large-scale multiprocessor history,

L-61snooping limitations, 363–364

Non-unit stridesmultidimensional arrays in vector

architectures, 278–279

vector processor, 310, 310–311, G-25

North-East-West-South communication, network topology calculations, F-41 to F-43

North-last routing, F-48Not a Number (NaN), J-14, J-16, J-21,

J-34Notifications, interconnection

networks, F-10NOW project, L-73No-write allocate

definition, B-11example calculation, B-12

NSFNET, F-98NTSC/PAL encoder, Sanyo

VPC-SX500 digital camera, E-19

Nullification, PA-RISC instructions, K-33 to K-34

Nullifying branch, branch delay slots, C-24 to C-25

NUMA, see Nonuniform memory access (NUMA)

NVIDIA GeForce, L-51NVIDIA systems

fine-grained multithreading, 224GPU comparisons, 323–330,

325GPU computational structures,

291–297GPU computing history, L-52GPU ISA, 298–300GPU Memory structures, 304,

304–305GPU programming, 289graphics pipeline history, L-51scalable GPUs, L-51terminology, 313–315

N-way set associativeblock placement, B-7conflict misses, B-23memory hierarchy basics, 74TLBs, B-49

NYU Ultracomputer, L-60

OObserved performance, fallacies, 57Occupancy, communication

bandwidth, I-3

Page 51: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-53

Ocean applicationcharacteristics, I-9 to I-10distributed-memory

multiprocessor, I-32distributed-memory

multiprocessors, I-30example calculations, I-11 to I-12miss rates, I-28symmetric shared-memory

multiprocessors, I-23OCNs, see On-chip networks (OCNs)Offline reconstruction, RAID, D-55Offload engines

network interfaces, F-8TCP/IP reliance, F-95

Offsetaddressing modes, 12AMD64 paged virtual memory,

B-55block identification, B-7 to B-8cache optimization, B-38call gates, B-54control flow instructions, A-18directory-based cache coherence

protocols, 381–382example, B-9gather-scatter, 280IA-32 segment, B-53instruction decode, C-5 to C-6main memory, B-44memory mapping, B-52MIPS, C-32MIPS control flow instructions,

A-37 to A-38misaligned addresses, A-8Opteron data cache, B-13 to B-14pipelining, C-42PTX instructions, 300RISC, C-4 to C-6RISC instruction set, C-4TLB, B-46Tomasulo’s approach, 176virtual memory, B-43 to B-44,

B-49, B-55 to B-56OLTP, see On-Line Transaction

Processing (OLTP)Omega

example, F-31packet blocking, F-32topology, F-30

OMNETPP, Intel Core i7, 240–241On-chip cache

optimization, 79SRAM, 98–99

On-chip memory, embedded systems, E-4 to E-5

On-chip networks (OCNs)basic considerations, F-3commercial implementations, F-73commercial interconnection

networks, F-63cross-company interoperability,

F-64DOR, F-46effective bandwidth, F-18,

F-28example system, F-70 to F-72historical overview, F-103 to

F-104interconnection network domain

relationship, F-4interconnection network speed,

F-88latency and effective bandwidth,

F-26 to F-28latency vs. nodes, F-27link bandwidth, F-89packet latency, F-13, F-14 to F-16switch microarchitecture, F-57time of flight, F-13topology, F-30wormhole switching, F-51

One’s complement, J-7One-way conflict misses, definition,

B-23Online reconstruction, RAID, D-55On-Line Transaction Processing

(OLTP)commercial workload, 369, 371server benchmarks, 41shared-memory workloads,

368–370, 373–374storage system benchmarks, D-18

OpenCLGPU programming, 289GPU terminology, 292, 313–315NVIDIA terminology, 291processor comparisons, 323

OpenGL, L-51Open source software

Amazon Web Services, 457WSCs, 437Xen VMM, see Xen virtual

machine

Open Systems Interconnect (OSI)Ethernet, F-78 to F-79layer definitions, F-82

Operand addressing mode, Intel 80x86, K-59, K-59 to K-60

Operand delivery stage, Itanium 2, H-42

OperandsDSP, E-6forwarding, C-19instruction set encoding, A-21 to

A-22Intel 80x86, K-59ISA, 12ISA classification, A-3 to A-4MIPS data types, A-34MIPS pipeline, C-71MIPS pipeline FP operations, C-52

to C-53NVIDIA GPU ISA, 298per ALU instruction example, A-6TMS320C55 DSP, E-6type and size, A-13 to A-14VAX, K-66 to K-68, K-68vector execution time, 268–269

Operating systems (general)address translation, B-38and architecture development, 2communication performance, F-8disk access scheduling, D-44 to

D-45, D-45memory protection performance,

B-58miss statistics, B-59multiprocessor software

development, 408and page size, B-58segmented virtual memory, B-54server benchmarks, 40shared-memory workloads,

374–378storage systems, D-35

Operational costsbasic considerations, 33WSCs, 434, 438, 452, 456, 472

Operational expenditures (OPEX)WSC costs, 452–455, 454WSC TCO case study, 476–478

Operation faults, storage systems, D-11Operator dependability, disks, D-13 to

D-15

Page 52: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-54 ■ Index

OPEX, see Operational expenditures (OPEX)

Optical media, interconnection networks, F-9

Oracle databasecommercial workload, 368miss statistics, B-59multithreading benchmarks, 232single-threaded benchmarks, 243WSC services, 441

Ordering, and deadlock, F-47Organization

buffer, switch microarchitecture, F-58 to F-60

cache, performance impact, B-19

cache blocks, B-7 to B-8cache optimization, B-19coherence extensions, 362computer architecture, 11, 15–16DRAM, 98MIPS pipeline, C-37multiple-issue processor, 197, 198Opteron data cache, B-12 to B-13,

B-13pipelines, 152processor history, 2–3processor performance equation,

49shared-memory multiprocessors,

346Sony PlayStation Emotion Engine,

E-18TLB, B-46

Orthogonality, compiler writing-architecture relationship, A-30

OSI, see Open Systems Interconnect (OSI)

Out-of-order completiondata hazards, 169MIPS pipeline, C-71MIPS R100000 sequential

consistency, 397precise exceptions, C-58

Out-of-order executionand cache miss, B-2 to B-3cache performance, B-21data hazards, 169–170hardware-based execution, 184ILP, 245memory hierarchy, B-2 to B-3

microarchitectural techniques case study, 247–254

MIPS pipeline, C-71miss penalty, B-20 to B-22performance milestones, 20power/DLP issues, 322processor comparisons, 323R10000, 397SMT, 246Tomasulo’s algorithm, 183

Out-of-order processorsDLP, 322Intel Core i7, 236memory hierarchy history, L-11multithreading, 226vector architecture, 267

Out-of-order write, dynamic scheduling, 171

Output buffered switchHOL blocking, F-60microarchitecture, F-57, F-57organizations, F-58 to F-59pipelined version, F-61

Output dependencecompiler history, L-30 to L-31definition, 152–153dynamic scheduling, 169–171, C-72finding, H-7 to H-8loop-level parallelism calculations,

320MIPS scoreboarding, C-79

Overclockingmicroprocessors, 26processor performance equation,

52Overflow, integer arithmetic, J-8, J-10

to J-11, J-11Overflow condition code, MIPS core,

K-9 to K-16Overhead

adaptive routing, F-93 to F-94Amdahl’s law, F-91communication latency, I-4interconnection networks, F-88,

F-91 to F-92OCNs vs. SANs, F-27vs. peak performance, 331shared-memory communication,

I-5sorting case study, D-64 to D-67time of flight, F-14vector processor, G-4

Overlapping tripletshistorical background, J-63integer multiplication, J-49

Oversubscriptionarray switch, 443Google WSC, 469WSC architecture, 441, 461

PPacked decimal, definition, A-14Packet discarding, congestion

management, F-65Packets

ATM, F-79bidirectional rings, F-35 to F-36centralized switched networks,

F-32effective bandwidth vs. packet size,

F-19format example, F-7IBM Blue Gene/L 3D torus

network, F-73InfiniBand, F-75, F-76interconnection networks,

multi-device networks, F-25

latency issues, F-12, F-13lossless vs. lossy networks, F-11 to

F-12network interfaces, F-8 to F-9network routing, F-44routing/arbitration/switching

impact, F-52switched network topology, F-40switching, F-51switch microarchitecture, F-57 to

F-58switch microarchitecture

pipelining, F-60 to F-61TI TMS320C6x DSP, E-10topology, F-21virtual channels and throughput,

F-93Packet transport, interconnection

networks, F-9 to F-12Page coloring, definition, B-38Paged segments, characteristics, B-43

to B-44Paged virtual memory

Opteron example, B-54 to B-57protection, 106vs. segmented, B-43

Page 53: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-55

Page faultscache optimization, A-46exceptions, C-43 to C-44hardware-based speculation, 188and memory hierarchy, B-3MIPS exceptions, C-48Multimedia SIMD Extensions, 284stopping/restarting execution, C-46virtual memory definition, B-42virtual memory miss, B-45

Page offsetcache optimization, B-38main memory, B-44TLB, B-46virtual memory, B-43, B-49, B-55

to B-56Pages

definition, B-3vs. segments, B-43size selection, B-46 to B-47virtual memory definition, B-42 to

B-43virtual memory fast address

translation, B-46Page size

cache optimization, B-38definition, B-56memory hierarchy example, B-39,

B-48and OS, B-58OS determination, B-58paged virtual memory, B-55selection, B-46 to B-47virtual memory, B-44

Page Table Entry (PTE)AMD64 paged virtual memory,

B-56IA-32 equivalent, B-52Intel Core i7, 120main memory block, B-44 to B-45paged virtual memory, B-56TLB, B-47

Page tablesaddress translation, B-46 to B-47AMD64 paged virtual memory,

B-55 to B-56descriptor tables as, B-52IA-32 segment descriptors, B-53main memory block, B-44 to B-45multiprocessor software

development, 407–409multithreading, 224

protection process, B-50segmented virtual memory, B-51virtual memory block

identification, B-44virtual-to-physical address

mapping, B-45Paired single operations, DSP media

extensions, E-11Palt, definition, B-3Papadopolous, Greg, L-74Parallelism

cache optimization, 79challenges, 349–351classes, 9–10computer design principles, 44–45dependence analysis, H-8DLP, see Data-level parallelism

(DLP)Ethernet, F-78exploitation statically, H-2exposing with hardware support,

H-23 to H-27global code scheduling, H-15 to

H-23, H-16IA-64 instruction format, H-34 to

H-35ILP, see Instruction-level

parallelism (ILP)loop-level, 149–150, 215,

217–218, 315–322MIPS scoreboarding, C-77 to C-78multiprocessors, 345natural, 223, 344request-level, 4–5, 9, 345, 434RISC development, 2software pipelining, H-12 to H-15for speedup, 263superblock scheduling, H-21 to

H-23, H-22task-level, 9TLP, see Thread-level parallelism

(TLP)trace scheduling, H-19 to H-21,

H-20vs. window size, 217WSCs vs. servers, 433–434

Parallel processorsareas of debate, L-56 to L-58bus-based coherent multiprocessor

history, L-59 to L-60cluster history, L-62 to L-64early computers, L-56

large-scale multiprocessor history, L-60 to L-61

recent advances and developments, L-58 to L-60

scientific applications, I-33 to I-34SIMD computer history, L-55 to

L-56synchronization and consistency

models, L-64virtual memory history, L-64

Parallel programmingcomputation communication, I-10

to I-12with large-scale multiprocessors, I-2

Parallel Thread Execution (PTX)basic GPU thread instructions, 299GPU conditional branching,

300–303GPUs vs. vector architectures, 308NVIDIA GPU ISA, 298–300NVIDIA GPU Memory structures,

305Parallel Thread Execution (PTX)

InstructionCUDA Thread, 300definition, 292, 309, 313GPU conditional branching, 302–303GPU terms, 308NVIDIA GPU ISA, 298, 300

Paravirtualizationsystem call performance, 141Xen VM, 111

Paritydirty bits, D-61 to D-64fault detection, 58memory dependability, 104–105WSC memory, 473–474

PARSEC benchmarksIntel Core i7, 401–405SMT on superscalar processors,

230–232, 231speedup without SMT, 403–404

Partial disk failure, dirty bits, D-61 to D-64

Partial store order, relaxed consistency models, 395

Partitioned add operation, DSP media extensions, E-10

PartitioningMultimedia SIMD Extensions, 282virtual memory protection, B-50WSC memory hierarchy, 445

Page 54: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-56 ■ Index

Pascal programscompiler types and classes, A-28integer division/remainder, J-12

Pattern, disk array deconstruction, D-51Payload

messages, F-6packet format, F-7

p bits, J-21 to J-23, J-25, J-36 to J-37PC, see Program counter (PC)PCI bus, historical background, L-81PCIe, see PCI-Express (PCIe)PCI-Express (PCIe), F-29, F-63

storage area network history, F-102 to F-103

PCI-X, F-29storage area network history,

F-102PCI-X 2.0, F-63PCMCIA slot, Sony PlayStation 2

Emotion Engine case study, E-15

PC-relative addressing mode, VAX, K-67

PDP-11, L-10, L-17 to L-19, L-56PDU, see Power distribution unit

(PDU)Peak performance

Cray X1E, G-24DAXPY on VMIPS, G-21DLP, 322fallacies, 57–58multiple lanes, 273multiprocessor scaled programs,

58Roofline model, 287vector architectures, 331VMIPS on DAXPY, G-17WSC operational costs, 434

Peer-to-peerinternetworking, F-81 to F-82wireless networks, E-22

Pegasus, L-16PennySort competition, D-66Perfect Club benchmarks

vector architecture programming, 281, 281–282

vector processor history, G-28Perfect processor, ILP hardware

model, 214–215, 215Perfect-shuffle exchange,

interconnection network topology, F-30 to F-31

Performability, RAID reconstruction, D-55 to D-57

Performance, see also Peak performance

advanced directory protocol case study, 420–426

ARM Cortex-A8, 233–236, 234ARM Cortex-A8 memory,

115–117bandwidth vs. latency, 18–19benchmarks, 37–41branch penalty reduction, C-22branch schemes, C-25 to C-26cache basics, B-3 to B-6cache performance

average memory access time, B-16 to B-20

basic considerations, B-3 to B-6, B-16

basic equations, B-22basic optimizations, B-40example calculation, B-16 to

B-17out-of-order processors, B-20

to B-22compiler optimization impact,

A-27cost-performance

extensive pipelining, C-80 to C-81

WSC Flash memory, 474–475WSC goals/requirements, 433WSC hardware inactivity, 474WSC processors, 472–473

CUDA, 290–291desktop benchmarks, 38–40directory-based coherence case

study, 418–420dirty bits, D-61 to D-64disk array deconstruction, D-51 to

D-55disk deconstruction, D-48 to D-51DRAM, 100–102embedded computers, 9, E-13 to

E-14Google server benchmarks,

439–441hardward fallacies, 56high-performance computing, 432,

435–436, B-10historical milestones, 20ILP exploitation, 201

ILP for realizable processors, 216–218

Intel Core i7, 239–241, 240, 401–405

Intel Core i7 memory, 122–124interconnection networks

bandwidth considerations, F-89multi-device networks, F-25 to

F-29routing/arbitration/switching

impact, F-52 to F-55two-device networks, F-12 to

F-20Internet Archive Cluster, D-38 to

D-40interprocessor communication, I-3

to I-6I/O devices, D-15 to D-16I/O subsystem design, D-59 to

D-61I/O system design/evaluation,

D-36ISA, 241–243Itanium 2, H-43large-scale multiprocessors

scientific applicationsdistributed-memory

multiprocessors, I-26 to I-32, I-28 to I-30, I-32

parallel processors, I-33 to I-34

symmetric shared-memory multiprocessor, I-21 to I-26, I-23 to I-25

synchronization, I-12 to I-16MapReduce, 438measurement, reporting,

summarization, 36–37memory consistency models, 393memory hierarchy design, 73memory hierarchy and OS, B-58memory threads, GPUs, 332MIPS FP pipeline, C-60 to C-61MIPS M2000 vs. VAX 8700, K-82MIPS R4000 pipeline, C-67 to

C-70, C-68multicore processors, 400–401,

401multiprocessing/multithreading,

398–400multiprocessors, measurement

issues, 405–406

Page 55: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-57

multiprocessor software development, 408–409

network topologies, F-40, F-40 to F-44

observed, 57peak

DLP, 322fallacies, 57–58multiple lanes, 273Roofline model, 287vector architectures, 331WSC operational costs, 434

pipelines with stalls, C-12 to C-13pipelining basics, C-10 to C-11processors, historical growth, 2–3,

3quantitative measures, L-6 to L-7real-time, PMDs, 6real-world server considerations,

52–55results reporting, 41results summarization, 41–43, 43RISC classic pipeline, C-7server benchmarks, 40–41as server characteristic, 7single-chip multicore processor

case study, 412–418single-thread, 399

processor benchmarks, 243software development, 4software overhead issues, F-91 to

F-92sorting case study, D-64 to D-67speculation cost, 211Sun T1 multithreading unicore,

227–229superlinear, 406switch microarchitecture

pipelining, F-60 to F-61symmetric shared-memory

multiprocessors, 366–378

scientific workloads, I-21 to I-26, I-23

system call virtualization/paravirtualization, 141

transistors, scaling, 19–21vector, and memory bandwidth,

332vector add instruction, 272vector kernel implementation,

334–336

vector processor, G-2 to G-7DAXPY on VMIPS, G-19 to

G-21sparse matrices, G-12 to G-14start-up and multiple lanes, G-7

to G-9vector processors

chaining, G-11 to G-12chaining/unchaining, G-12

vector vs. scalar, 331–332VMIPS on Linpack, G-17 to G-19wormhole switching, F-92 to F-93

Permanent failure, commercial interconnection networks, F-66

Permanent faults, storage systems, D-11

Personal computersLANs, F-4networks, F-2PCIe, F-29

Personal mobile device (PMD)characteristics, 6as computer class, 5embedded computers, 8–9Flash memory, 18integrated circuit cost trends, 28ISA performance and efficiency

prediction, 241–243memory hierarchy basics, 78memory hierarchy design, 72power and energy, 25processor comparison, 242

PetaBox GB2000, Internet Archive Cluster, D-37

Phase-ordering problem, compiler structure, A-26

Phits, see Physical transfer units (phits)

Physical addressesaddress translation, B-46AMD Opteron data cache, B-12 to

B-13ARM Cortex-A8, 115directory-based cache coherence

protocol basics, 382main memory block, B-44memory hierarchy, B-48 to B-49memory hierarchy basics, 77–78memory mapping, B-52paged virtual memory, B-55 to

B-56

page table-based mapping, B-45safe calls, B-54segmented virtual memory, B-51sharing/protection, B-52translation, B-36 to B-39virtual memory definition, B-42

Physical cache, definition, B-36 to B-37Physical channels, F-47Physical layer, definition, F-82Physical memory

centralized shared-memory multiprocessors, 347

directory-based cache coherence, 354

future GPU features, 332GPU conditional branching, 303main memory block, B-44memory hierarchy basics, B-41 to

B-42multiprocessors, 345paged virtual memory, B-56processor comparison, 323segmented virtual memory, B-51unified, 333Virtual Machines, 110

Physical transfer units (phits), F-60Physical volumes, D-34PID, see Process-identifier (PID) tagsPin-out bandwidth, topology, F-39Pipeline bubble, stall as, C-13Pipeline cycles per instruction

basic equation, 148ILP, 149processor performance

calculations, 218–219R4000 performance, C-68 to C-69

Pipelined circuit switching, F-50Pipelined CPUs, early versions, L-26

to L-27Pipeline delays

ARM Cortex-A8, 235definition, 228fine-grained multithreading, 227instruction set complications, C-50multiple branch speculation, 211Sun T1 multithreading unicore

performance, 227–228Pipeline interlock

data dependences, 151data hazards requiring stalls, C-20MIPS R4000, C-65MIPS vs. VMIPS, 268

Page 56: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-58 ■ Index

Pipeline latchesALU, C-40definition, C-35R4000, C-60stopping/restarting execution, C-47

Pipeline organizationdependences, 152MIPS, C-37

Pipeline registersbranch hazard stall, C-42data hazards, C-57data hazard stalls, C-17 to C-20definition, C-35example, C-9MIPS, C-36 to C-39MIPS extension, C-53PC as, C-35pipelining performance issues,

C-10RISC processor, C-8, C-10

Pipeline schedulingbasic considerations, 161–162vs. dynamic scheduling, 168–169ILP exploitation, 197ILP exposure, 157–161microarchitectural techniques case

study, 247–254MIPS R4000, C-64

Pipeline stall cyclesbranch scheme performance, C-25pipeline performance, C-12 to C-13

Pipeliningbranch cost reduction, C-26branch hazards, C-21 to C-26branch issues, C-39 to C-42branch penalty reduction, C-22 to

C-25branch-prediction buffers, C-27 to

C-30, C-29branch scheme performance, C-25

to C-26cache access, 82case studies, C-82 to C-88classic stages for RISC, C-6 to

C-10compiler scheduling, L-31concept, C-2 to C-3cost-performance, C-80 to C-81data hazards, C-16 to C-21definition, C-2dynamically scheduled pipelines,

C-70 to C-80

example, C-8exception stopping/restarting, C-46

to C-47exception types and requirements,

C-43 to C-46execution sequences, C-80floating-point addition speedup,

J-25graphics pipeline history, L-51hazard classes, C-11hazard detection, C-38implementation difficulties, C-43

to C-49independent FP operations, C-54instruction set complications, C-49

to C-51interconnection networks, F-12latencies, C-87MIPS, C-34 to C-36MIPS control, C-36 to C-39MIPS exceptions, C-48, C-48 to

C-49MIPS FP performance, C-60 to

C-61MIPS multicycle operations

basic considerations, C-51 to C-54

hazards and forwarding, C-54 to C-58

precise exceptions, C-58 to C-60MIPS R4000

FP pipeline, C-65 to C-67, C-67

overview, C-61 to C-65pipeline performance, C-67 to

C-70pipeline structure, C-62 to C-63

multiple outstanding FP operations, C-54

performance issues, C-10 to C-11performance with stalls, C-12 to

C-13predicted-not-taken scheme, C-22RISC instruction set, C-4 to C-5,

C-70simple implementation, C-30 to

C-43, C-34simple RISC, C-5 to C-6, C-7static branch prediction, C-26 to

C-27structural hazards, C-13 to C-16,

C-15

switch microarchitecture, F-60 to F-61

unoptimized code, C-81Pipe segment, definition, C-3Pipe stage

branch prediction, C-28data hazards, C-16definition, C-3dynamic scheduling, C-71FP pipeline, C-66integrated instruction fetch units,

207MIPS, C-34 to C-35, C-37, C-49MIPS extension, C-53MIPS R4000, C-62out-of-order execution, 170pipeline stalls, C-13pipeling performance issues, C-10register additions, C-35RISC processor, C-7stopping/restarting execution, C-46WAW, 153

pjbb2005 benchmarkIntel Core i7, 402SMT on superscalar processors,

230–232, 231PLA, early computer arithmetic, J-65PMD, see Personal mobile device

(PMD)Points-to analysis, basic approach, H-9Point-to-point links

bus replacement, D-34Ethernet, F-79storage systems, D-34switched-media networks, F-24

Point-to-point multiprocessor, example, 413

Point-to-point networksdirectory-based coherence, 418directory protocol, 421–422SMP limitations, 363–364

Poison bits, compiler-based speculation, H-28, H-30

Poisson, Siméon, D-28Poisson distribution

basic equation, D-28random variables, D-26 to D-34

Polycyclic scheduling, L-30Portable computers

interconnection networks, F-85processor comparison, 242

Port number, network interfaces, F-7

Page 57: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-59

Position independence, control flow instruction addressing modes, A-17

Powerdistribution for servers, 490distribution overview, 447and DLP, 322first-level caches, 79–80Google server benchmarks,

439–441Google WSC, 465–468PMDs, 6real-world server considerations,

52–55WSC infrastructure, 447WSC power modes, 472WSC resource allocation case

study, 478–479WSC TCO case study, 476–478

Power consumption, see also Energy efficiency

cache optimization, 96cache size and associativity, 81case study, 63–64computer components, 63DDR3 SDRAM, 103disks, D-5embedded benchmarks, E-13GPUs vs. vector architectures, 311interconnection networks, F-85ISA performance and efficiency

prediction, 242–243microprocessor, 23–26SDRAMs, 102SMT on superscalar processors,

230–231speculation, 210–211system trends, 21–23TI TMS320C55 DSP, E-8WSCs, 450

Power distribution unit (PDU), WSC infrastructure, 447

Power failureexceptions, C-43 to C-44, C-46utilities, 435WSC storage, 442

Power gating, transistors, 26Power modes, WSCs, 472PowerPC

addressing modes, K-5AltiVec multimedia instruction

compiler support, A-31

ALU, K-5arithmetic/logical instructions,

K-11branches, K-21cluster history, L-63conditional branches, K-17conditional instructions, H-27condition codes, K-10 to K-11consistency model, 395constant extension, K-9conventions, K-13data transfer instructions, K-10features, K-44FP instructions, K-23IBM Blue Gene/L, I-41 to I-42multimedia compiler support,

A-31, K-17precise exceptions, C-59RISC architecture, A-2RISC code size, A-23as RISC systems, K-4unique instructions, K-32 to K-33

PowerPC ActiveCcharacteristics, K-18multimedia support, K-19

PowerPC AltiVec, multimedia support, E-11

Power-performancelow-power servers, 477servers, 54

Power Supply Units (PSUs), efficiency ratings, 462

Power utilization effectiveness (PUE)datacenter comparison, 451Google WSC, 468Google WSC containers, 464–465WSC, 450–452WSCs vs. datacenters, 456WSC server energy efficiency, 462

Precise exceptionsdefinition, C-47dynamic scheduling, 170hardware-based speculation,

187–188, 221instruction set complications, C-49maintaining, C-58 to C-60MIPS exceptions, C-48

Precisions, floating-point arithmetic, J-33 to J-34

Predicated instructionsexposing parallelism, H-23 to H-27IA-64, H-38 to H-40

Predicate Registersdefinition, 309GPU conditional branching, 300–301IA-64, H-34NVIDIA GPU ISA, 298vectors vs. GPUs, 311

Predication, TI TMS320C6x DSP, E-10Predicted-not-taken scheme

branch penalty reduction, C-22, C-22 to C-23

MIPS R4000 pipeline, C-64Predictions, see also Mispredictions

address aliasing, 213–214, 216branch

correlation, 162–164cost reduction, 162–167, C-26dynamic, C-27 to C-30ideal processor, 214ILP exploitation, 201instruction fetch bandwidth, 205integrated instruction fetch

units, 207Intel Core i7, 166–167, 239–241static, C-26 to C-27

branch-prediction buffers, C-27 to C-30, C-29

jump prediction, 214PMDs, 6return address buffer, 2072-bit scheme, C-28value prediction, 202, 212–213

Prefetchingintegrated instruction fetch units,

208Intel Core i7, 122, 123–124Itanium 2, H-42MIPS core extensions, K-20NVIDIA GPU Memory structures,

305parallel processing challenges, 351

Prefix, Intel 80x86 integer operations, K-51

Presentation layer, definition, F-82Present bit, IA-32 descriptor table,

B-52Price vs. cost, 32–33Price-performance ratio

cost trends, 28Dell PowerEdge servers, 53desktop comptuers, 6processor comparisons, 55WSCs, 8, 441

Page 58: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-60 ■ Index

Primitivesarchitect-compiler writer

relationship, A-30basic hardware types, 387–389compiler writing-architecture

relationship, A-30CUDA Thread, 289dependent computation

elimination, 321GPU vs. MIMD, 329locks via coherence, 391operand types and sizes, A-14 to

A-15PA-RISC instructions, K-34 to

K-35synchronization, 394, L-64

Principle of localitybidirectional MINs, F-33 to F-34cache optimization, B-26cache performance, B-3 to B-4coining of term, L-11commercial workload, 373computer design principles, 45definition, 45, B-2lock accesses, 390LRU, B-9memory accesses, 332, B-46memory hierarchy design, 72multilevel application, B-2multiprogramming workload, 375scientific workloads on symmetric

shared-memory multiprocessors, I-25

stride, 278WSC bottleneck, 461WSC efficiency, 450

Private datacache protocols, 359centralized shared-memory

multiprocessors, 351–352

Private Memorydefinition, 292, 314NVIDIA GPU Memory structures,

304Private variables, NVIDIA GPU

Memory, 304Procedure calls

compiler structure, A-25 to A-26control flow instructions, A-17,

A-19 to A-21dependence analysis, 321

high-level instruction set, A-42 to A-43

IA-64 register model, H-33invocation options, A-19ISAs, 14MIPS control flow instructions, A-38return address predictors, 206VAX, B-73 to B-74, K-71 to K-72VAX vs. MIPS, K-75VAX swap, B-74 to B-75

Process conceptdefinition, 106, B-49protection schemes, B-50

Process-identifier (PID) tags, cache addressing, B-37 to B-38

Process IDs, Virtual Machines, 110Processor consistency

latency hiding with speculation, 396–397

relaxed consistency models, 395Processor cycles

cache performance, B-4definition, C-3memory banks, 277multithreading, 224

Processor-dependent optimizationscompilers, A-26performance impact, A-27types, A-28

Processor-intensive benchmarks, desktop performance, 38

Processor performanceand average memory access time,

B-17 to B-20vs. cache performance, B-16clock rate trends, 24desktop benchmarks, 38, 40historical trends, 3, 3–4multiprocessors, 347uniprocessors, 344

Processor performance equation, computer design principles, 48–52

Processor speedand clock rate, 244and CPI, 244snooping cache coherence, 364

Process switchdefinition, 106, B-49miss rate vs. virtual addressing,

B-37

multithreading, 224PID, B-37virtual memory-based protection,

B-49 to B-50Producer-server model, response time

and throughput, D-16Productivity

CUDA, 290–291NVIDIA programmers, 289software development, 4virtual memory and programming,

B-41WSC, 450

Profile-based predictor, misprediction rate, C-27

Program counter (PC)addressing modes, A-10ARM Cortex-A8, 234branch hazards, C-21branch-target buffers, 203,

203–204, 206control flow instruction addressing

modes, A-17dynamic branch prediction, C-27

to C-28exception stopping/restarting, C-46

to C-47GPU conditional branching, 303Intel Core i7, 120M32R instructions, K-39MIPS control flow instructions,

A-38multithreading, 223–224pipeline branch issues, C-39 to

C-41pipe stages, C-35precise exceptions, C-59 to C-60RISC classic pipeline, C-8RISC instruction set, C-5simple MIPS implementation,

C-31 to C-33TLP, 344virtual memory protection, 106

Program counter-relative addressingcontrol flow instructions, A-17 to

A-18, A-21definition, A-10MIPS instruction format, A-35

Programming modelsCUDA, 300, 310, 315GPUs, 288–291latency in consistency models, 397

Page 59: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-61

memory consistency, 393Multimedia SIMD architectures,

285vector architectures, 280–282WSCs, 436–441

Programming primitive, CUDA Thread, 289

Program ordercache coherence, 353control dependences, 154–155data hazards, 153dynamic scheduling, 168–169, 174hardware-based speculation, 192ILP exploitation, 200name dependences, 152–153Tomasulo’s approach, 182

Protection schemescontrol dependence, 155development, L-9 to L-12and ISA, 112network interfaces, F-7network user access, F-86 to F-87Pentium vs. Opteron, B-57processes, B-50safe calls, B-54segmented virtual memory

example, B-51 to B-54Virtual Machines, 107–108virtual memory, 105–107, B-41

Protocol deadlock, routing, F-44Protocol stack

example, F-83internetworking, F-83

Pseudo-least recently used (LRU)block replacement, B-9 to B-10Intel Core i7, 118

PSUs, see Power Supply Units (PSUs)PTE, see Page Table Entry (PTE)PTX, see Parallel Thread Execution

(PTX)PUE, see Power utilization

effectiveness (PUE)Python language, hardware impact on

software development, 4

QQCDOD, L-64QoS, see Quality of service (QoS)QsNetII, F-63, F-76Quadrics SAN, F-67, F-100 to F-101Quality of service (QoS)

dependability benchmarks, D-21

WAN history, F-98Quantitative performance measures,

development, L-6 to L-7Queue

definition, D-24waiting time calculations, D-28 to

D-29Queue discipline, definition, D-26Queuing locks, large-scale

multiprocessor synchronization, I-18 to I-21

Queuing theorybasic assumptions, D-30Little’s law, D-24 to D-25M/M/1 model, D-31 to D-33, D-32overview, D-23 to D-26RAID performance prediction,

D-57 to D-59single-server model, D-25

Quickpath (Intel Xeon), cache coherence, 361

RRace-to-halt, definition, 26Rack units (U), WSC architecture, 441Radio frequency amplifier, radio

receiver, E-23Radio receiver, components, E-23Radio waves, wireless networks, E-21Radix-2 multiplication/division, J-4 to

J-7, J-6, J-55Radix-4 multiplication/division, J-48

to J-49, J-49, J-56 to J-57, J-60 to J-61

Radix-8 multiplication, J-49RAID (Redundant array of

inexpensive disks)data replication, 439dependability benchmarks, D-21,

D-22disk array deconstruction case

study, D-51, D-55disk deconstruction case study,

D-48hardware dependability, D-15historical background, L-79 to

L-80I/O subsystem design, D-59 to

D-61logical units, D-35memory dependability, 104

NetApp FAS6000 filer, D-41 to D-42

overview, D-6 to D-8, D-7performance prediction, D-57 to

D-59reconstruction case study, D-55 to

D-57row-diagonal parity, D-9WSC storage, 442

RAID 0, definition, D-6RAID 1

definition, D-6historical background, L-79

RAID 2definition, D-6historical background, L-79

RAID 3definition, D-7historical background, L-79 to

L-80RAID 4

definition, D-7historical background, L-79 to

L-80RAID 5

definition, D-8historical background, L-79 to

L-80RAID 6

characteristics, D-8 to D-9hardware dependability, D-15

RAID 10, D-8RAM (random access memory), switch

microarchitecture, F-57RAMAC-350 (Random Access

Method of Accounting Control), L-77 to L-78, L-80 to L-81

Random Access Method of Accounting Control, L-77 to L-78

Random replacementcache misses, B-10definition, B-9

Random variables, distribution, D-26 to D-34

RAR, see Read after read (RAR)RAS, see Row access strobe (RAS)RAW, see Read after write (RAW)Ray casting (RC)

GPU comparisons, 329throughput computing kernel, 327

Page 60: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-62 ■ Index

RDMA, see Remote direct memory access (RDMA)

Read after read (RAR), absence of data hazard, 154

Read after write (RAW)data hazards, 153dynamic scheduling with

Tomasulo’s algorithm, 170–171

first vector computers, L-45hazards, stalls, C-55hazards and forwarding, C-55 to

C-57instruction set complications, C-50microarchitectural techniques case

study, 253MIPS FP pipeline performance,

C-60 to C-61MIPS pipeline control, C-37 to C-38MIPS pipeline FP operations, C-53MIPS scoreboarding, C-74ROB, 192TI TMS320C55 DSP, E-8Tomasulo’s algorithm, 182unoptimized code, C-81

Read missAMD Opteron data cache, B-14cache coherence, 357, 358,

359–361coherence extensions, 362directory-based cache coherence

protocol example, 380, 382–386

memory hierarchy basics, 76–77memory stall clock cycles, B-4miss penalty reduction, B-35 to

B-36Opteron data cache, B-14vs. write-through, B-11

Read operands stageID pipe stage, 170MIPS scoreboarding, C-74 to C-75out-of-order execution, C-71

Realizable processors, ILP limitations, 216–220

Real memory, Virtual Machines, 110Real-time constraints, definition, E-2Real-time performance, PMDs, 6Real-time performance requirement,

definition, E-3Real-time processing, embedded

systems, E-3 to E-5

Rearrangeably nonblocking, centralized switched networks, F-32 to F-33

Receiving overheadcommunication latency, I-3 to I-4interconnection networks, F-88OCNs vs. SANs, F-27time of flight, F-14

RECN, see Regional explicit congestion notification (RECN)

Reconfiguration deadlock, routing, F-44

Reconstruction, RAID, D-55 to D-57Recovery time, vector processor, G-8Recurrences

basic approach, H-11loop-carried dependences, H-5

Red-black Gauss-Seidel, Ocean application, I-9 to I-10

Reduced Instruction Set Computer, see RISC (Reduced Instruction Set Computer)

Reductionscommercial workloads, 371cost trends, 28loop-level parallelism

dependences, 321multiprogramming workloads, 377T1 multithreading unicore

performance, 227WSCs, 438

RedundancyAmdahl’s law, 48chip fabrication cost case study,

61–62computer system power

consumption case study, 63–64

index checks, B-8integrated circuit cost, 32integrated circuit failure, 35simple MIPS implementation,

C-33WSC, 433, 435, 439WSC bottleneck, 461WSC storage, 442

Redundant array of inexpensive disks, see RAID (Redundant array of inexpensive disks)

Redundant multiplication, integers, J-48

Redundant power supplies, example calculations, 35

Reference bitmemory hierarchy, B-52virtual memory block replacement,

B-45Regional explicit congestion

notification (RECN), congestion management, F-66

Register addressing modeMIPS, 12VAX, K-67

Register allocationcompilers, 396, A-26 to A-29VAX sort, K-76VAX swap, K-72

Register deferred addressing, VAX, K-67

Register definition, 314Register fetch (RF)

MIPS data path, C-34MIPS R4000, C-63pipeline branches, C-41simple MIPS implementation,

C-31simple RISC implementation, C-5

to C-6Register file

data hazards, C-16, C-18, C-20dynamic scheduling, 172, 173,

175, 177–178Fermi GPU, 306field, 176hardware-based speculation, 184longer latency pipelines, C-55 to

C-57MIPS exceptions, C-49MIPS implementation, C-31, C-33MIPS R4000, C-64MIPS scoreboarding, C-75Multimedia SIMD Extensions,

282, 285multiple lanes, 272, 273multithreading, 224OCNs, F-3precise exceptions, C-59RISC classic pipeline, C-7 to C-8RISC instruction set, C-5 to C-6scoreboarding, C-73, C-75

Page 61: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-63

speculation support, 208structural hazards, C-13Tomasulo’s algorithm, 180, 182vector architecture, 264VMIPS, 265, 308

Register indirect addressing mode, Intel 80x86, K-47

Register management, software-pipelined loops, H-14

Register-memory instruction set architecture

architect-compiler writer relationship, A-30

dynamic scheduling, 171Intel 80x86, K-52ISA classification, 11, A-3 to A-6

Register prefetch, cache optimization, 92

Register renamingdynamic scheduling, 169–172hardware vs. software speculation,

222ideal processor, 214ILP hardware model, 214ILP limitations, 213, 216–217ILP for realizable processors, 216instruction delivery and

speculation, 202microarchitectural techniques case

study, 247–254name dependences, 153vs. ROB, 208–210ROB instruction, 186sample code, 250SMT, 225speculation, 208–210superscalar code, 251Tomasulo’s algorithm, 183WAW/WAR hazards, 220

Register result status, MIPS scoreboard, C-76

RegistersDSP examples, E-6IA-64, H-33 to H-34instructions and hazards, C-17Intel 80x86, K-47 to K-49, K-48network interface functions, F-7pipe stages, C-35PowerPC, K-10 to K-11VAX swap, B-74 to B-75

Register stack engine, IA-64, H-34

Register tag example, 177Register windows, SPARC

instructions, K-29 to K-30

Regularitybidirectional MINs, F-33 to F-34compiler writing-architecture

relationship, A-30Relative speedup, multiprocessor

performance, 406Relaxed consistency models

basic considerations, 394–395compiler optimization, 396WSC storage software, 439

Release consistency, relaxed consistency models, 395

ReliabilityAmdahl’s law calculations, 56commercial interconnection

networks, F-66example calculations, 48I/O subsystem design, D-59 to

D-61modules, SLAs, 34MTTF, 57redundant power supplies, 34–35storage systems, D-44transistor scaling, 21

Relocation, virtual memory, B-42Remainder, floating point, J-31 to

J-32Remington-Rand, L-5Remote direct memory access

(RDMA), InfiniBand, F-76

Remote node, directory-based cache coherence protocol basics, 381–382

Reorder buffer (ROB)compiler-based speculation, H-31dependent instructions, 199dynamic scheduling, 175FP unit with Tomasulo’s

algorithm, 185hardware-based speculation,

184–192ILP exploitation, 199–200ILP limitations, 216Intel Core i7, 238vs. register renaming, 208–210

Repeat interval, MIPS pipeline FP operations, C-52 to C-53

Replicationcache coherent multiprocessors, 354centralized shared-memory

architectures, 351–352coherence enforcement, 354R4000 performance, C-70RAID storage servers, 439TLP, 344virtual memory, B-48 to B-49WSCs, 438

Reply, messages, F-6Reproducibility, performance results

reporting, 41Request

messages, F-6switch microarchitecture, F-58

Requested protection level, segmented virtual memory, B-54

Request-level parallelism (RLP)basic characteristics, 345definition, 9from ILP, 4–5MIMD, 10multicore processors, 400multiprocessors, 345parallelism advantages, 44server benchmarks, 40WSCs, 434, 436

Request phase, arbitration, F-49Request-reply deadlock, routing, F-44Reservation stations

dependent instructions, 199–200dynamic scheduling, 178example, 177fields, 176hardware-based speculation, 184,

186, 189–191ILP exploitation, 197, 199–200Intel Core i7, 238–240loop iteration example, 181microarchitectural techniques case

study, 253–254speculation, 208–209Tomasulo’s algorithm, 172, 173,

174–176, 179, 180, 180–182

Resource allocationcomputer design principles, 45WSC case study, 478–479

Resource sparing, commercial interconnection networks, F-66

Page 62: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-64 ■ Index

Response time, see also LatencyI/O benchmarks, D-18performance considerations, 36performance trends, 18–19producer-server model, D-16server benchmarks, 40–41storage systems, D-16 to D-18vs. throughput, D-17user experience, 4WSCs, 450

ResponsivenessPMDs, 6as server characteristic, 7

Restartable pipelinedefinition, C-45exceptions, C-46 to C-47

Restorations, SLA states, 34Restoring division, J-5, J-6Resume events

control dependences, 156exceptions, C-45 to C-46hardware-based speculation, 188

Return address predictorsinstruction fetch bandwidth,

206–207prediction accuracy, 207

ReturnsAmdahl’s law, 47cache coherence, 352–353compiler technology and

architectural decisions, A-28

control flow instructions, 14, A-17, A-21

hardware primitives, 388Intel 80x86 integer operations,

K-51invocation options, A-19procedure invocation options,

A-19return address predictors, 206

Reverse path, cell phones, E-24RF, see Register fetch (RF)Rings

characteristics, F-73NEWS communication, F-42OCN history, F-104process protection, B-50topology, F-35 to F-36, F-36

Ripple-carry adder, J-3, J-3, J-42chip comparison, J-60

Ripple-carry addition, J-2 to J-3

RISC (Reduced Instruction Set Computer)

addressing modes, K-5 to K-6Alpha-unique instructions, K-27 to

K-29architecture flaws vs. success,

A-45ARM-unique instructions, K-36 to

K-37basic concept, C-4 to C-5basic systems, K-3 to K-5cache performance, B-6classic pipeline stages, C-6 to C-10code size, A-23 to A-24compiler history, L-31desktop/server systems, K-4

instruction formats, K-7multimedia extensions, K-16 to

K-19desktop systems

addressing modes, K-5arithmetic/logical instructions,

K-11, K-22conditional branches, K-17constant extension, K-9control instructions, K-12conventions, K-13data transfer instructions, K-10,

K-21features, K-44FP instructions, K-13, K-23multimedia extensions, K-18

development, 2early pipelined CPUs, L-26embedded systems, K-4

addressing modes, K-6arithmetic/logical instructions,

K-15, K-24conditional branches, K-17constant extension, K-9control instructions, K-16conventions, K-16data transfers, K-14, K-23DSP extensions, K-19instruction formats, K-8multiply-accumulate, K-20

historical background, L-19 to L-21

instruction formats, K-5 to K-6instruction set lineage, K-43ISA performance and efficiency

prediction, 241

M32R-unique instructions, K-39 to K-40

MIPS16-unique instructions, K-40 to K-42

MIPS64-unique instructions, K-24 to K-27

MIPS core common extensions, K-19 to K-24

MIPS M2000 vs. VAX 8700, L-21Multimedia SIMD Extensions

history, L-49 to L-50operations, 12PA-RISC-unique, K-33 to K-35pipelining efficiency, C-70PowerPC-unique instructions,

K-32 to K-33Sanyo VPC-SX500 digital camera,

E-19simple implementation, C-5 to C-6simple pipeline, C-7SPARC-unique instructions, K-29

to K-32Sun T1 multithreading, 226–227SuperH-unique instructions, K-38

to K-39Thumb-unique instructions, K-37

to K-38vector processor history, G-26Virtual Machines ISA support, 109Virtual Machines and virtual

memory and I/O, 110RISC-I, L-19 to L-20RISC-II, L-19 to L-20RLP, see Request-level parallelism

(RLP)ROB, see Reorder buffer (ROB)Roofline model

GPU performance, 326memory bandwidth, 332Multimedia SIMD Extensions,

285–288, 287Round digit, J-18Rounding modes, J-14, J-17 to J-19,

J-18, J-20FP precisions, J-34fused multiply-add, J-33

Round-robin (RR)arbitration, F-49IBM 360, K-85 to K-86InfiniBand, F-74

RoutersBARRNet, F-80

Page 63: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-65

Ethernet, F-79Routing algorithm

commercial interconnection networks, F-56

fault tolerance, F-67implementation, F-57Intel SCCC, F-70interconnection networks, F-21 to

F-22, F-27, F-44 to F-48mesh network, F-46network impact, F-52 to F-55OCN history, F-104and overhead, F-93 to F-94SAN characteristics, F-76switched-media networks, F-24switch microarchitecture

pipelining, F-61system area network history, F-100

Row access strobe (RAS), DRAM, 98Row-diagonal parity

example, D-9RAID, D-9

Row major order, blocking, 89RR, see Round-robin (RR)RS format instructions, IBM 360,

K-87Ruby on Rails, hardware impact on

software development, 4RX format instructions, IBM 360,

K-86 to K-87

SS3, see Amazon Simple Storage

Service (S3)SaaS, see Software as a Service (SaaS)Sandy Bridge dies, wafter example, 31SANs, see System/storage area

networks (SANs)Sanyo digital cameras, SOC, E-20Sanyo VPC-SX500 digital camera,

embedded system case study, E-19

SAS, see Serial Attach SCSI (SAS) drive

SASI, L-81SATA (Serial Advanced Technology

Attachment) disksGoogle WSC servers, 469NetApp FAS6000 filer, D-42power consumption, D-5RAID 6, D-8vs. SAS drives, D-5

storage area network history, F-103Saturating arithmetic, DSP media

extensions, E-11Saturating operations, definition, K-18

to K-19SAXPY, GPU raw/relative

performance, 328Scalability

cloud computing, 460coherence issues, 378–379Fermi GPU, 295Java benchmarks, 402multicore processors, 400multiprocessing, 344, 395parallelism, 44as server characteristic, 7transistor performance and wires,

19–21WSCs, 8, 438WSCs vs. servers, 434

Scalable GPUs, historical background, L-50 to L-51

Scalar expansion, loop-level parallelism dependences, 321

Scalar Processors, see also Superscalar processors

definition, 292, 309early pipelined CPUs, L-26 to L-27lane considerations, 273Multimedia SIMD/GPU

comparisons, 312NVIDIA GPU, 291prefetch units, 277vs. vector, 311, G-19vector performance, 331–332

Scalar registersCray X1, G-21 to G-22GPUs vs. vector architectures, 311loop-level parallelism

dependences, 321–322Multimedia SIMD vs. GPUs, 312sample renaming code, 251vector vs. GPU, 311vs. vector performance, 331–332VMIPS, 265–266

Scaled addressing, VAX, K-67Scaled speedup, Amdahl’s law and

parallel computers, 406–407

ScalingAmdahl’s law and parallel

computers, 406–407

cloud computing, 456computation-to-communication

ratios, I-11DVFS, 25, 52, 467dynamic voltage-frequency, 25,

52, 467Intel Core i7, 404interconnection network speed, F-88multicore vs. single-core, 402processor performance trends, 3scientific applications on parallel

processing, I-34shared- vs. switched-media

networks, F-25transistor performance and wires,

19–21VMIPS, 267

Scan Line Interleave (SLI), scalable GPUs, L-51

SCCC, see Intel Single-Chip Cloud Computing (SCCC)

Schorr, Herb, L-28Scientific applications

Barnes, I-8 to I-9basic characteristics, I-6 to I-7cluster history, L-62distributed-memory

multiprocessors, I-26 to I-32, I-28 to I-32

FFT kernel, I-7LU kernel, I-8Ocean, I-9 to I-10parallel processors, I-33 to I-34parallel program computation/

communication, I-10 to I-12, I-11

parallel programming, I-2symmetric shared-memory

multiprocessors, I-21 to I-26, I-23 to I-25

ScoreboardingARM Cortex-A8, 233, 234components, C-76definition, 170dynamic scheduling, 171, 175and dynamic scheduling, C-71 to

C-80example calculations, C-77MIPS structure, C-73NVIDIA GPU, 296results tables, C-78 to C-79SIMD thread scheduler, 296

Page 64: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-66 ■ Index

Scripting languages, software development impact, 4

SCSI (Small Computer System Interface)

Berkeley’s Tertiary Disk project, D-12

dependability benchmarks, D-21disk storage, D-4historical background, L-80 to L-81I/O subsystem design, D-59RAID reconstruction, D-56storage area network history,

F-102SDRAM, see Synchronous dynamic

random-access memory (SDRAM)

SDRWAVE, J-62Second-level caches, see also L2

cachesARM Cortex-A8, 114ILP, 245Intel Core i7, 121interconnection network, F-87Itanium 2, H-41memory hierarchy, B-48 to B-49miss penalty calculations, B-33 to

B-34miss penalty reduction, B-30 to

B-35miss rate calculations, B-31 to

B-35and relative execution time, B-34speculation, 210SRAM, 99

Secure Virtual Machine (SVM), 129Seek distance

storage disks, D-46system comparison, D-47

Seek time, storage disks, D-46Segment basics

Intel 80x86, K-50vs. page, B-43virtual memory definition, B-42 to

B-43Segment descriptor, IA-32 processor,

B-52, B-53Segmented virtual memory

bounds checking, B-52Intel Pentium protection, B-51 to

B-54memory mapping, B-52vs. paged, B-43

safe calls, B-54sharing and protection, B-52 to

B-53Self-correction, Newton’s algorithm,

J-28 to J-29Self-draining pipelines, L-29Self-routing, MINs, F-48Semantic clash, high-level instruction

set, A-41Semantic gap, high-level instruction

set, A-39Semiconductors

DRAM technology, 17Flash memory, 18GPU vs. MIMD, 325manufacturing, 3–4

Sending overheadcommunication latency, I-3 to I-4OCNs vs. SANs, F-27time of flight, F-14

Sense-reversing barriercode example, I-15, I-21large-scale multiprocessor

synchronization, I-14Sequence of SIMD Lane Operations,

definition, 292, 313Sequency number, packet header, F-8Sequential consistency

latency hiding with speculation, 396–397

programmer’s viewpoint, 394relaxed consistency models,

394–395requirements and implementation,

392–393Sequential interleaving, multibanked

caches, 86, 86Sequent Symmetry, L-59Serial Advanced Technology

Attachment disks, see SATA (Serial Advanced Technology Attachment) disks

Serial Attach SCSI (SAS) drivehistorical background, L-81power consumption, D-5vs. SATA drives, D-5

Serializationbarrier synchronization, I-16coherence enforcement, 354directory-based cache coherence,

382

DSM multiprocessor cache coherence, I-37

hardware primitives, 387multiprocessor cache coherency,

353page tables, 408snooping coherence protocols, 356write invalidate protocol

implementation, 356Serpentine recording, L-77Serve-longest-queue (SLQ) scheme,

arbitration, F-49ServerNet interconnection network,

fault tolerance, F-66 to F-67

Servers, see also Warehouse-scale computers (WSCs)

as computer class, 5cost calculations, 454, 454–455definition, D-24energy savings, 25Google WSC, 440, 467, 468–469GPU features, 324memory hierarchy design, 72vs. mobile GPUs, 323–330multiprocessor importance, 344outage/anomaly statistics, 435performance benchmarks, 40–41power calculations, 463power distribution example, 490power-performance benchmarks,

54, 439–441power-performance modes, 477real-world examples, 52–55RISC systems

addressing modes and instruction formats, K-5 to K-6

examples, K-3, K-4instruction formats, K-7multimedia extensions, K-16

to K-19single-server model, D-25system characteristics, E-4workload demands, 439WSC vs. datacenters, 455–456WSC data transfer, 446WSC energy efficiency, 462–464vs. WSC facility costs, 472WSC memory hierarchy, 444WSC resource allocation case

study, 478–479

Page 65: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-67

vs. WSCs, 432–434WSC TCO case study, 476–478

Server side Java operations per second (ssj_ops)

example calculations, 439power-performance, 54real-world considerations, 52–55

Server utilizationcalculation, D-28 to D-29queuing theory, D-25

Service accomplishment, SLAs, 34Service Health Dashboard, AWS, 457Service interruption, SLAs, 34Service level agreements (SLAs)

Amazon Web Services, 457and dependability, 33WSC efficiency, 452

Service level objectives (SLOs)and dependability, 33WSC efficiency, 452

Session layer, definition, F-82Set associativity

and access time, 77address parts, B-9AMD Opteron data cache, B-12 to

B-14ARM Cortex-A8, 114block placement, B-7 to B-8cache block, B-7cache misses, 83–84, B-10cache optimization, 79–80, B-33 to

B-35, B-38 to B-40commercial workload, 371energy consumption, 81memory access times, 77memory hierarchy basics, 74, 76nonblocking cache, 84performance equations, B-22pipelined cache access, 82way prediction, 81

Set basicsblock replacement, B-9 to B-10definition, B-7

Set-on-less-than instructions (SLT)MIPS16, K-14 to K-15MIPS conditional branches, K-11

to K-12Settle time, D-46SFF, see Small form factor (SFF) diskSFS benchmark, NFS, D-20SGI, see Silicon Graphics systems

(SGI)

Shadow page table, Virtual Machines, 110

Sharding, WSC memory hierarchy, 445

Shared-media networkseffective bandwidth vs. nodes,

F-28exampl, F-22latency and effective bandwidth,

F-26 to F-28multiple device connections, F-22

to F-24vs. switched-media networks, F-24

to F-25Shared Memory

definition, 292, 314directory-based cache coherence,

418–420DSM, 347–348, 348, 354–355,

378–380invalidate protocols, 356–357SMP/DSM definition, 348terminology comparison, 315

Shared-memory communication, large-scale multiprocessors, I-5

Shared-memory multiprocessorsbasic considerations, 351–352basic structure, 346–347cache coherence, 352–353cache coherence enforcement,

354–355cache coherence example,

357–362cache coherence extensions,

362–363data caching, 351–352definition, L-63historical background, L-60 to

L-61invalidate protocol

implementation, 356–357

limitations, 363–364performance, 366–378single-chip multicore case study,

412–418SMP and snooping limitations,

363–364snooping coherence

implementation, 365–366

snooping coherence protocols, 355–356

WSCs, 435, 441Shared-memory synchronization,

MIPS core extensions, K-21

Shared statecache block, 357, 359cache coherence, 360cache miss calculations, 366–367coherence extensions, 362directory-based cache coherence

protocol basics, 380, 385

private cache, 358Sharing addition, segmented virtual

memory, B-52 to B-53Shear algorithms, disk array

deconstruction, D-51 to D-52, D-52 to D-54

Shifting over zeros, integer multiplication/division, J-45 to J-47

Short-circuiting, see ForwardingSI format instructions, IBM 360, K-87Signals, definition, E-2Signal-to-noise ratio (SNR), wireless

networks, E-21Signed-digit representation

example, J-54integer multiplication, J-53

Signed number arithmetic, J-7 to J-10Sign-extended offset, RISC, C-4 to

C-5Significand, J-15Sign magnitude, J-7Silicon Graphics 4D/240, L-59Silicon Graphics Altix, F-76, L-63Silicon Graphics Challenge, L-60Silicon Graphics Origin, L-61, L-63Silicon Graphics systems (SGI)

economies of scale, 456miss statistics, B-59multiprocessor software

development, 407–409vector processor history, G-27

SIMD (Single Instruction Stream, Multiple Data Stream)

definition, 10Fermi GPU architectural

innovations, 305–308GPU conditional branching, 301

Page 66: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-68 ■ Index

SIMD (continued )GPU examples, 325GPU programming, 289–290GPUs vs. vector architectures,

308–309historical overview, L-55 to L-56loop-level parallelism, 150MapReduce, 438memory bandwidth, 332multimedia extensions, see

Multimedia SIMD Extensions

multiprocessor architecture, 346multithreaded, see Multithreaded

SIMD ProcessorNVIDIA GPU computational

structures, 291NVIDIA GPU ISA, 300power/DLP issues, 322speedup via parallelism, 263supercomputer development, L-43

to L-44system area network history, F-100Thread Block mapping, 293TI 320C6x DSP, E-9

SIMD InstructionCUDA Thread, 303definition, 292, 313DSP media extensions, E-10function, 150, 291GPU Memory structures, 304GPUs, 300, 305Grid mapping, 293IBM Blue Gene/L, I-42Intel AVX, 438multimedia architecture

programming, 285multimedia extensions, 282–285,

312multimedia instruction compilers,

A-31 to A-32Multithreaded SIMD Processor

block diagram, 294PTX, 301Sony PlayStation 2, E-16Thread of SIMD Instructions,

295–296thread scheduling, 296–297, 297,

305vector architectures as superset,

263–264vector/GPU comparison, 308

Vector Registers, 309SIMD Lane Registers, definition, 309,

314SIMD Lanes

definition, 292, 296, 309DLP, 322Fermi GPU, 305, 307GPU, 296–297, 300, 324GPU conditional branching,

302–303GPUs vs. vector architectures, 308,

310, 311instruction scheduling, 297multimedia extensions, 285Multimedia SIMD vs. GPUs, 312,

315multithreaded processor, 294NVIDIA GPU Memory, 304synchronization marker, 301vector vs. GPU, 308, 311

SIMD Processors, see also Multithreaded SIMD Processor

block diagram, 294definition, 292, 309, 313–314dependent computation

elimination, 321design, 333Fermi GPU, 296, 305–308Fermi GTX 480 GPU floorplan,

295, 295–296GPU conditional branching, 302GPU vs. MIMD, 329GPU programming, 289–290GPUs vs. vector architectures, 310,

310–311Grid mapping, 293Multimedia SIMD vs. GPU, 312multiprocessor architecture, 346NVIDIA GPU computational

structures, 291NVIDIA GPU Memory structures,

304–305processor comparisons, 324Roofline model, 287, 326system area network history, F-100

SIMD ThreadGPU conditional branching,

301–302Grid mapping, 293Multithreaded SIMD processor,

294

NVIDIA GPU, 296NVIDIA GPU ISA, 298NVIDIA GPU Memory structures,

305scheduling example, 297vector vs. GPU, 308vector processor, 310

SIMD Thread Schedulerdefinition, 292, 314example, 297Fermi GPU, 295, 305–307, 306GPU, 296

SIMT (Single Instruction, Multiple Thread)

GPU programming, 289vs. SIMD, 314Warp, 313

Simultaneous multithreading (SMT)

characteristics, 226definition, 224–225historical background, L-34 to

L-35IBM eServer p5 575, 399ideal processors, 215Intel Core i7, 117–118, 239–241Java and PARSEC workloads,

403–404multicore performance/energy

efficiency, 402–405multiprocessing/

multithreading-based performance, 398–400

multithreading history, L-35superscalar processors, 230–232

Single-extended precision floating-point arithmetic, J-33 to J-34

Single Instruction, Multiple Thread, see SIMT (Single Instruction, Multiple Thread)

Single Instruction Stream, Multiple Data Stream, see SIMD (Single Instruction Stream, Multiple Data Stream)

Single Instruction Stream, Single Data Stream, see SISD (Single Instruction Stream, Single Data Stream)

Page 67: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-69

Single-level cache hierarchy, miss rates vs. cache size, B-33

Single-precision floating pointarithmetic, J-33 to J-34GPU examples, 325GPU vs. MIMD, 328MIPS data types, A-34MIPS operations, A-36Multimedia SIMD Extensions, 283operand sizes/types, 12, A-13as operand type, A-13 to A-14representation, J-15 to J-16

Single-Streaming Processor (SSP)Cray X1, G-21 to G-24Cray X1E, G-24

Single-thread (ST) performanceIBM eServer p5 575, 399, 399Intel Core i7, 239ISA, 242processor comparison, 243

SISD (Single Instruction Stream, Single Data Stream), 10

SIMD computer history, L-55Skippy algorithm

disk deconstruction, D-49sample results, D-50

SLAs, see Service level agreements (SLAs)

SLI, see Scan Line Interleave (SLI)SLOs, see Service level objectives

(SLOs)SLQ, see Serve-longest-queue (SLQ)

schemeSLT, see Set-on-less-than instructions

(SLT)SM, see Distributed shared memory

(DSM)Small Computer System Interface, see

SCSI (Small Computer System Interface)

Small form factor (SFF) disk, L-79Smalltalk, SPARC instructions, K-30Smart interface cards, vs. smart

switches, F-85 to F-86Smartphones

ARM Cortex-A8, 114mobile vs. server GPUs, 323–324

Smart switches, vs. smart interface cards, F-85 to F-86

SMP, see Symmetric multiprocessors (SMP)

SMT, see Simultaneous multithreading (SMT)

Snooping cache coherencebasic considerations, 355–356controller transitions, 421definition, 354–355directory-based, 381, 386,

420–421example, 357–362implementation, 365–366large-scale multiprocessor history,

L-61large-scale multiprocessors, I-34 to

I-35latencies, 414limitations, 363–364sample types, L-59single-chip multicore processor

case study, 412–418symmetric shared-memory

machines, 366SNR, see Signal-to-noise ratio

(SNR)SoC, see System-on-chip (SoC)Soft errors, definition, 104Soft real-time

definition, E-3PMDs, 6

Software as a Service (SaaS)clusters/WSCs, 8software development, 4WSCs, 438WSCs vs. servers, 433–434

Software developmentmultiprocessor architecture issues,

407–409performance vs. productivity, 4WSC efficiency, 450–452

Software pipeliningexample calculations, H-13 to

H-14loops, execution pattern, H-15technique, H-12 to H-15, H-13

Software prefetching, cache optimization, 131–133

Software speculationdefinition, 156vs. hardware speculation, 221–222VLIW, 196

Software technologyILP approaches, 148large-scale multiprocessors, I-6

large-scale multiprocessor synchronization, I-17 to I-18

network interfaces, F-7vs. TCP/IP reliance, F-95Virtual Machines protection, 108WSC running service, 434–435

Solaris, RAID benchmarks, D-22, D-22 to D-23

Solid-state disks (SSDs)processor performance/price/

power, 52server energy efficiency, 462WSC cost-performance, 474–475

Sonic Smart Interconnect, OCNs, F-3Sony PlayStation 2

block diagram, E-16embedded multiprocessors, E-14Emotion Engine case study, E-15

to E-18Emotion Engine organization,

E-18Sorting, case study, D-64 to D-67Sort primitive, GPU vs. MIMD, 329Sort procedure, VAX

bubble sort, K-76example code, K-77 to K-79vs. MIPS32, K-80register allocation, K-76

Source routing, basic concept, F-48SPARCLE processor, L-34Sparse matrices

loop-level parallelism dependences, 318–319

vector architectures, 279–280, G-12 to G-14

vector execution time, 271vector mask registers, 275

Spatial localitycoining of term, L-11definition, 45, B-2memory hierarchy design, 72

SPEC benchmarksbranch predictor correlation,

162–164desktop performance, 38–40early performance measures, L-7evolution, 39fallacies, 56operands, A-14performance, 38performance results reporting, 41

Page 68: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-70 ■ Index

SPEC benchmarks (continued )processor performance growth, 3static branch prediction, C-26 to

C-27storage systems, D-20 to D-21tournament predictors, 164two-bit predictors, 165vector processor history, G-28

SPEC89 benchmarksbranch-prediction buffers, C-28 to

C-30, C-30MIPS FP pipeline performance,

C-61 to C-62misprediction rates, 166tournament predictors, 165–166VAX 8700 vs. MIPS M2000, K-82

SPEC92 benchmarkshardware vs. software speculation,

221ILP hardware model, 215MIPS R4000 performance, C-68 to

C-69, C-69misprediction rate, C-27

SPEC95 benchmarksreturn address predictors, 206–207,

207way prediction, 82

SPEC2000 benchmarksARM Cortex-A8 memory,

115–116cache performance prediction,

125–126cache size and misses per

instruction, 126compiler optimizations, A-29compulsory miss rate, B-23data reference sizes, A-44hardware prefetching, 91instruction misses, 127

SPEC2006 benchmarks, evolution, 39SPECCPU2000 benchmarks

displacement addressing mode, A-12

Intel Core i7, 122server benchmarks, 40

SPECCPU2006 benchmarksbranch predictors, 167Intel Core i7, 123–124, 240,

240–241ISA performance and efficiency

prediction, 241Virtual Machines protection, 108

SPECfp benchmarkshardware prefetching, 91interconnection network, F-87ISA performance and efficiency

prediction, 241–242Itanium 2, H-43MIPS FP pipeline performance,

C-60 to C-61nonblocking caches, 84tournament predictors, 164

SPECfp92 benchmarksIntel 80x86 vs. DLX, K-63Intel 80x86 instruction lengths,

K-60Intel 80x86 instruction mix, K-61Intel 80x86 operand type

distribution, K-59nonblocking cache, 83

SPECfp2000 benchmarkshardware prefetching, 92MIPS dynamic instruction mix,

A-42Sun Ultra 5 execution times, 43

SPECfp2006 benchmarksIntel processor clock rates, 244nonblocking cache, 83

SPECfpRate benchmarksmulticore processor performance,

400multiprocessor cost effectiveness,

407SMT, 398–400SMT on superscalar processors,

230SPEChpc96 benchmark, vector

processor history, G-28Special-purpose machines

historical background, L-4 to L-5SIMD computer history, L-56

Special-purpose registercompiler writing-architecture

relationship, A-30ISA classification, A-3VMIPS, 267

Special valuesfloating point, J-14 to J-15representation, J-16

SPECINT benchmarkshardware prefetching, 92interconnection network, F-87ISA performance and efficiency

prediction, 241–242

Itanium 2, H-43nonblocking caches, 84

SPECInt92 benchmarksIntel 80x86 vs. DLX, K-63Intel 80x86 instruction lengths,

K-60Intel 80x86 instruction mix, K-62Intel 80x86 operand type

distribution, K-59nonblocking cache, 83

SPECint95 benchmarks, interconnection networks, F-88

SPECINT2000 benchmarks, MIPS dynamic instruction mix, A-41

SPECINT2006 benchmarksIntel processor clock rates, 244nonblocking cache, 83

SPECintRate benchmarkmulticore processor performance,

400multiprocessor cost effectiveness,

407SMT, 398–400SMT on superscalar processors,

230SPEC Java Business Benchmark

(JBB)multicore processor performance,

400multicore processors, 402multiprocessing/

multithreading-based performance, 398

server, 40Sun T1 multithreading unicore

performance, 227–229, 229

SPECJVM98 benchmarks, ISA performance and efficiency prediction, 241

SPECMail benchmark, characteristics, D-20

SPEC-optimized processors, vs. density-optimized, F-85

SPECPower benchmarksGoogle server benchmarks,

439–440, 440multicore processor performance,

400

Page 69: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-71

real-world server considerations, 52–55

WSCs, 463WSC server energy efficiency,

462–463SPECRate benchmarks

Intel Core i7, 402multicore processor performance,

400multiprocessor cost effectiveness,

407server benchmarks, 40

SPECRate2000 benchmarks, SMT, 398–400

SPECRatiosexecution time examples, 43geometric means calculations,

43–44SPECSFS benchmarks

example, D-20servers, 40

Speculation, see also Hardware-based speculation; Software speculation

advantages/disadvantages, 210–211

compilers, see Compiler speculation

concept origins, L-29 to L-30and energy efficiency, 211–212FP unit with Tomasulo’s

algorithm, 185hardware vs. software, 221–222IA-64, H-38 to H-40ILP studies, L-32 to L-33Intel Core i7, 123–124latency hiding in consistency

models, 396–397memory reference, hardware

support, H-32and memory system, 222–223microarchitectural techniques case

study, 247–254multiple branches, 211register renaming vs. ROB,

208–210SPECvirt_Sc2010 benchmarks, server,

40SPECWeb benchmarks

characteristics, D-20dependability, D-21parallelism, 44server benchmarks, 40

SPECWeb99 benchmarksmultiprocessing/

multithreading-based performance, 398

Sun T1 multithreading unicore performance, 227, 229

SpeedupAmdahl’s law, 46–47floating-point addition, J-25 to

J-26integer addition

carry-lookahead, J-37 to J-41carry-lookahead circuit, J-38carry-lookahead tree, J-40 to

J-41carry-lookahead tree adder,

J-41carry-select adder, J-43, J-43 to

J-44, J-44carry-skip adder, J-41 to J43,

J-42overview, J-37

integer divisionradix-2 division, J-55radix-4 division, J-56radix-4 SRT division, J-57with single adder, J-54 to J-58

integer multiplicationarray multiplier, J-50Booth recoding, J-49even/odd array, J-52with many adders, J-50 to J-54multipass array multiplier,

J-51signed-digit addition table,

J-54with single adder, J-47 to J-49,

J-48Wallace tree, J-53

integer multiplication/division, shifting over zeros, J-45 to J-47

integer SRT division, J-45 to J-46, J-46

linear, 405–407via parallelism, 263pipeline with stalls, C-12 to C-13relative, 406scaled, 406–407switch buffer organizations, F-58

to F-59true, 406

Sperry-Rand, L-4 to L-5

Spin locksvia coherence, 389–390large-scale multiprocessor

synchronizationbarrier synchronization, I-16exponential back-off, I-17

SPLASH parallel benchmarks, SMT on superscalar processors, 230

Split, GPU vs. MIMD, 329SPRAM, Sony PlayStation 2 Emotion

Engine organization, E-18

Sprowl, Bob, F-99Squared coefficient of variance, D-27SRAM, see Static random-access

memory (SRAM)SRT division

chip comparison, J-60 to J-61complications, J-45 to J-46early computer arithmetic, J-65example, J-46historical background, J-63integers, with adder, J-55 to J-57radix-4, J-56, J-57

SSDs, see Solid-state disks (SSDs)SSE, see Intel Streaming SIMD

Extension (SSE)SS format instructions, IBM 360, K-85

to K-88ssj_ops, see Server side Java

operations per second (ssj_ops)

SSP, see Single-Streaming Processor (SSP)

Stack architectureand compiler technology, A-27flaws vs. success, A-44 to A-45historical background, L-16 to

L-17Intel 80x86, K-48, K-52, K-54operands, A-3 to A-4

Stack frame, VAX, K-71Stack pointer, VAX, K-71Stack or Thread Local Storage,

definition, 292Stale copy, cache coherency, 112Stall cycles

advanced directory protocol case study, 424

average memory access time, B-17branch hazards, C-21branch scheme performance, C-25

Page 70: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-72 ■ Index

Stall cycles (continued)definition, B-4 to B-5example calculation, B-31loop unrolling, 161MIPS FP pipeline performance,

C-60miss rate calculation, B-6out-of-order processors, B-20 to

B-21performance equations, B-22pipeline performance, C-12 to

C-13single-chip multicore

multiprocessor case study, 414–418

structural hazards, C-15Stalls

AMD Opteron data cache, B-15ARM Cortex-A8, 235, 235–236branch hazards, C-42data hazard minimization, C-16 to

C-19, C-18data hazards requiring, C-19 to

C-21delayed branch, C-65Intel Core i7, 239–241microarchitectural techniques case

study, 252MIPS FP pipeline performance,

C-60 to C-61, C-61 to C-62

MIPS pipeline multicycle operations, C-51

MIPS R4000, C-64, C-67, C-67 to C-69, C-69

miss rate calculations, B-31 to B-32

necessity, C-21nonblocking cache, 84pipeline performance, C-12 to

C-13from RAW hazards, FP code, C-55structural hazard, C-15VLIW sample code, 252VMIPS, 268

Standardization, commercial interconnection networks, F-63 to F-64

Stardent-1500, Livermore Fortran kernels, 331

Start-up overhead, vs. peak performance, 331

Start-up timeDAXPY on VMIPS, G-20memory banks, 276page size selection, B-47peak performance, 331vector architectures, 331, G-4,

G-4, G-8vector convoys, G-4vector execution time, 270–271vector performance, G-2vector performance measures, G-16vector processor, G-7 to G-9, G-25VMIPS, G-5

State transition diagramdirector vs. cache, 385directory-based cache coherence,

383Statically based exploitation, ILP, H-2Static power

basic equation, 26SMT, 231

Static random-access memory (SRAM)

characteristics, 97–98dependability, 104fault detection pitfalls, 58power, 26vector memory systems, G-9vector processor, G-25yield, 32

Static schedulingdefinition, C-71ILP, 192–196and unoptimized code, C-81

Sticky bit, J-18Stop & Go, see Xon/XoffStorage area networks

dependability benchmarks, D-21 to D-23, D-22

historical overview, F-102 to F-103

I/O system as black blox, D-23Storage systems

asynchronous I/O and OSes, D-35Berkeley’s Tertiary Disk project,

D-12block servers vs. filers, D-34 to

D-35bus replacement, D-34component failure, D-43computer system availability, D-43

to D-44, D-44

dependability benchmarks, D-21 to D-23

dirty bits, D-61 to D-64disk array deconstruction case

study, D-51 to D-55, D-52 to D-55

disk arrays, D-6 to D-10disk deconstruction case study,

D-48 to D-51, D-50disk power, D-5disk seeks, D-45 to D-47disk storage, D-2 to D-5file system benchmarking, D-20,

D-20 to D-21Internet Archive Cluster, see

Internet Archive ClusterI/O performance, D-15 to D-16I/O subsystem design, D-59 to

D-61I/O system design/evaluation,

D-36 to D-37mail server benchmarking, D-20 to

D-21NetApp FAS6000 filer, D-41 to

D-42operator dependability, D-13 to

D-15OS-scheduled disk access, D-44 to

D-45, D-45point-to-point links, D-34, D-34queue I/O request calculations,

D-29queuing theory, D-23 to D-34RAID performance prediction,

D-57 to D-59RAID reconstruction case study,

D-55 to D-57real faults and failures, D-6 to

D-10reliability, D-44response time restrictions for

benchmarks, D-18seek distance comparison, D-47seek time vs. distance, D-46server utilization calculation, D-28

to D-29sorting case study, D-64 to D-67Tandem Computers, D-12 to D-13throughput vs. response time,

D-16, D-16 to D-18, D-17

TP benchmarks, D-18 to D-19

Page 71: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-73

transactions components, D-17web server benchmarking, D-20 to

D-21WSC vs. datacenter costs, 455WSCs, 442–443

Store conditionallocks via coherence, 391synchronization, 388–389

Store-and-forward packet switching, F-51

Store instructions, see also Load-store instruction set architecture

definition, C-4instruction execution, 186ISA, 11, A-3MIPS, A-33, A-36NVIDIA GPU ISA, 298Opteron data cache, B-15RISC instruction set, C-4 to C-6,

C-10vector architectures, 310

Streaming Multiprocessordefinition, 292, 313–314Fermi GPU, 307

Strecker, William, K-65Strided accesses

Multimedia SIMD Extensions, 283Roofline model, 287TLB interaction, 323

Strided addressing, see also Unit stride addressing

multimedia instruction compiler support, A-31 to A-32

Stridesgather-scatter, 280highly parallel memory systems,

133multidimensional arrays in vector

architectures, 278–279NVIDIA GPU ISA, 300vector memory systems, G-10 to

G-11VMIPS, 266

String operations, Intel 80x86, K-51, K-53

Stripe, disk array deconstruction, D-51Striping

disk arrays, D-6RAID, D-9

Strip-Mined Vector Loopconvoys, G-5

DAXPY on VMIPS, G-20definition, 292multidimensional arrays, 278Thread Block comparison, 294vector-length registers, 274

Strip miningDAXPY on VMIPS, G-20GPU conditional branching, 303GPUs vs. vector architectures, 311NVIDIA GPU, 291vector, 275VLRs, 274–275

Strong scaling, Amdahl’s law and parallel computers, 407

Structural hazardsbasic considerations, C-13 to C-16definition, C-11MIPS pipeline, C-71MIPS scoreboarding, C-78 to C-79pipeline stall, C-15vector execution time, 268–269

Structural stalls, MIPS R4000 pipeline, C-68 to C-69

Subset property, and inclusion, 397Summary overflow condition code,

PowerPC, K-10 to K-11Sun Microsystems

cache optimization, B-38fault detection pitfalls, 58memory dependability, 104

Sun Microsystems Enterprise, L-60Sun Microsystems Niagara (T1/T2)

processorscharacteristics, 227CPI and IPC, 399fine-grained multithreading, 224,

225, 226–229manufacturing cost, 62multicore processor performance,

400–401multiprocessing/

multithreading-based performance, 398–400

multithreading history, L-34T1 multithreading unicore

performance, 227–229Sun Microsystems SPARC

addressing modes, K-5ALU operands, A-6arithmetic/logical instructions,

K-11, K-31branch conditions, A-19

conditional branches, K-10, K-17

conditional instructions, H-27constant extension, K-9conventions, K-13data transfer instructions, K-10fast traps, K-30features, K-44FP instructions, K-23instruction list, K-31 to K-32integer arithmetic, J-12integer overflow, J-11ISA, A-2LISP, K-30MIPS core extensions, K-22 to K-23overlapped integer/FP operations,

K-31precise exceptions, C-60register windows, K-29 to K-30RISC history, L-20as RISC system, K-4Smalltalk, K-30synchronization history, L-64unique instructions, K-29 to K-32

Sun Microsystems SPARCCenter, L-60Sun Microsystems SPARCstation-2,

F-88Sun Microsystems SPARCstation-20,

F-88Sun Microsystems SPARC V8,

floating-point precisions, J-33

Sun Microsystems SPARC VIScharacteristics, K-18multimedia support, E-11, K-18

Sun Microsystems Ultra 5, SPECfp2000 execution times, 43

Sun Microsystems UltraSPARC, L-62, L-73

Sun Microsystems UltraSPARC T1 processor, characteristics, F-73

Sun Modular Datacenter, L-74 to L-75Superblock scheduling

basic process, H-21 to H-23compiler history, L-31example, H-22

Supercomputerscommercial interconnection

networks, F-63direct network topology, F-37

Page 72: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-74 ■ Index

Supercomputers (continued)low-dimensional topologies, F-100SAN characteristics, F-76SIMD, development, L-43 to L-44vs. WSCs, 8

Superlinear performance, multiprocessors, 406

Superpipeliningdefinition, C-61performance histories, 20

Superscalar processorscoining of term, L-29ideal processors, 214–215ILP, 192–197, 246

studies, L-32microarchitectural techniques case

study, 250–251multithreading support, 225recent advances, L-33 to L-34register renaming code, 251rename table and register

substitution logic, 251SMT, 230–232VMIPS, 267

Superscalar registers, sample renaming code, 251

Supervisor process, virtual memory protection, 106

Sussenguth, Ed, L-28Sutherland, Ivan, L-34SVM, see Secure Virtual Machine

(SVM)Swap procedure, VAX

code example, K-72, K-74full procedure, K-75 to K-76overview, K-72 to K-76register allocation, K-72register preservation, B-74 to B-75

Swim, data cache misses, B-10Switched-media networks

basic characteristics, F-24vs. buses, F-2effective bandwidth vs. nodes,

F-28example, F-22latency and effective bandwidth,

F-26 to F-28vs. shared-media networks, F-24 to

F-25Switched networks

centralized, F-30 to F-34DOR, F-46

OCN history, F-104topology, F-40

Switchesarray, WSCs, 443–444Benes networks, F-33context, 307, B-49early LANs and WANs, F-29Ethernet switches, 16, 20, 53,

441–444, 464–465, 469interconnecting node calculations,

F-35vs. NIC, F-85 to F-86, F-86process switch, 224, B-37, B-49 to

B-50storage systems, D-34switched-media networks, F-24WSC hierarchy, 441–442, 442WSC infrastructure, 446WSC network bottleneck, 461

Switch fabric, switched-media networks, F-24

Switchingcommercial interconnection

networks, F-56interconnection networks, F-22,

F-27, F-50 to F-52network impact, F-52 to F-55performance considerations, F-92

to F-93SAN characteristics, F-76switched-media networks, F-24system area network history, F-100

Switch microarchitecturebasic microarchitecture, F-55 to

F-58buffer organizations, F-58 to F-60enhancements, F-62HOL blocking, F-59input-output-buffered switch, F-57pipelining, F-60 to F-61, F-61

Switch portscentralized switched networks, F-30interconnection network topology,

F-29Switch statements

control flow instruction addressing modes, A-18

GPU, 301Syllable, IA-64, H-35Symbolic loop unrolling, software

pipelining, H-12 to H-15, H-13

Symmetric multiprocessors (SMP)characteristics, I-45communication calculations, 350directory-based cache coherence,

354first vector computers, L-47, L-49limitations, 363–364snooping coherence protocols,

354–355system area network history, F-101TLP, 345

Symmetric shared-memory multiprocessors, see also Centralized shared-memory multiprocessors

data caching, 351–352limitations, 363–364performance

commercial workload, 367–369commercial workload

measurement, 369–374multiprogramming and OS

workload, 374–378overview, 366–367

scientific workloads, I-21 to I-26, I-23 to I-25

Synapse N + 1, L-59Synchronization

AltaVista search, 369basic considerations, 386–387basic hardware primitives,

387–389consistency models, 395–396cost, 403Cray X1, G-23definition, 375GPU comparisons, 329GPU conditional branching,

300–303historical background, L-64large-scale multiprocessors

barrier synchronization, I-13 to I-16, I-14, I-16

challenges, I-12 to I-16hardware primitives, I-18 to

I-21sense-reversing barrier, I-21software implementations, I-17

to I-18tree-based barriers, I-19

locks via coherence, 389–391

Page 73: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-75

message-passing communication, I-5

MIMD, 10MIPS core extensions, K-21programmer’s viewpoint, 393–394PTX instruction set, 298–299relaxed consistency models,

394–395single-chip multicore processor

case study, 412–418vector vs. GPU, 311VLIW, 196WSCs, 434

Synchronous dynamic random-access memory (SDRAM)

ARM Cortex-A8, 117DRAM, 99vs. Flash memory, 103IBM Blue Gene/L, I-42Intel Core i7, 121performance, 100power consumption, 102, 103SDRAM timing diagram, 139

Synchronous event, exception requirements, C-44 to C-45

Synchronous I/O, definition, D-35Synonyms

address translation, B-38dependability, 34

Synthetic benchmarksdefinition, 37typical program fallacy, A-43

System area networks, historical overview, F-100 to F-102

System callsCUDA Thread, 297multiprogrammed workload, 378virtualization/paravirtualization

performance, 141virtual memory protection, 106

System interface controller (SIF), Intel SCCC, F-70

System-on-chip (SoC)cell phone, E-24cross-company interoperability,

F-64embedded systems, E-3Sanyo digital cameras, E-20Sanyo VPC-SX500 digital camera,

E-19

shared-media networks, F-23System Performance and Evaluation

Cooperative (SPEC), see SPEC benchmarks

System Processordefinition, 309DLP, 262, 322Fermi GPU, 306GPU issues, 330GPU programming, 288–289NVIDIA GPU ISA, 298NVIDIA GPU Memory, 305processor comparisons, 323–324synchronization, 329vector vs. GPU, 311–312

System response time, transactions, D-16, D-17

Systems on a chip (SOC), cost trends, 28

System/storage area networks (SANs)characteristics, F-3 to F-4communication protocols, F-8congestion management, F-65cross-company interoperability, F-64effective bandwidth, F-18example system, F-72 to F-74fat trees, F-34fault tolerance, F-67InfiniBand example, F-74 to F-77interconnection network domain

relationship, F-4LAN history, F-99latency and effective bandwidth,

F-26 to F-28latency vs. nodes, F-27packet latency, F-13, F-14 to F-16routing algorithms, F-48software overhead, F-91TCP/IP reliance, F-95time of flight, F-13topology, F-30

System Virtual Machines, definition, 107

TTag

AMD Opteron data cache, B-12 to B-14

ARM Cortex-A8, 115cache optimization, 79–80dynamic scheduling, 177invalidate protocols, 357

memory hierarchy basics, 74memory hierarchy basics, 77–78virtual memory fast address

translation, B-46write strategy, B-10

Tag check (TC)MIPS R4000, C-63R4000 pipeline, B-62 to B-63R4000 pipeline structure, C-63write process, B-10

Tag fieldsblock identification, B-8dynamic scheduling, 173, 175

Tail duplication, superblock scheduling, H-21

Tailgating, definition, G-20Tandem Computers

cluster history, L-62, L-72faults, D-14overview, D-12 to D-13

Target addressbranch hazards, C-21, C-42branch penalty reduction, C-22 to

C-23branch-target buffer, 206control flow instructions, A-17 to

A-18GPU conditional branching, 301Intel Core i7 branch predictor, 166MIPS control flow instructions,

A-38MIPS implementation, C-32MIPS pipeline, C-36, C-37MIPS R4000, C-25pipeline branches, C-39RISC instruction set, C-5

Target channel adapters (TCAs), switch vs. NIC, F-86

Target instructionsbranch delay slot scheduling, C-24as branch-target buffer variation,

206GPU conditional branching, 301

Task-level parallelism (TLP), definition, 9

TB, see Translation buffer (TB)TB-80 VME rack

example, D-38MTTF calculation, D-40 to D-41

TC, see Tag check (TC)TCAs, see Target channel adapters

(TCAs)

Page 74: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-76 ■ Index

TCO, see Total Cost of Ownership (TCO)

TCP, see Transmission Control Protocol (TCP)

TCP/IP, see Transmission Control Protocol/Internet Protocol (TCP/IP)

TDMA, see Time division multiple access (TDMA)

TDP, see Thermal design power (TDP)

Technology trendsbasic considerations, 17–18performance, 18–19

Teleconferencing, multimedia support, K-17

Temporal localityblocking, 89–90cache optimization, B-26coining of term, L-11definition, 45, B-2memory hierarchy design, 72

TERA processor, L-34Terminate events

exceptions, C-45 to C-46hardware-based speculation, 188loop unrolling, 161

Tertiary Disk projectfailure statistics, D-13overview, D-12system log, D-43

Test-and-set operation, synchronization, 388

Texas Instruments 8847arithmetic functions, J-58 to J-61chip comparison, J-58chip layout, J-59

Texas Instruments ASCfirst vector computers, L-44peak performance vs. start-up

overhead, 331TFLOPS, parallel processing debates,

L-57 to L-58TFT, see Thin-film transistor (TFT)Thacker, Chuck, F-99Thermal design power (TDP), power

trends, 22Thin-film transistor (TFT), Sanyo

VPC-SX500 digital camera, E-19

Thinking Machines, L-44, L-56Thinking Multiprocessors CM-5, L-60

Think time, transactions, D-16, D-17Third-level caches, see also L3 caches

ILP, 245interconnection network, F-87SRAM, 98–99

Thrash, memory hierarchy, B-25Thread Block

CUDA Threads, 297, 300, 303definition, 292, 313Fermi GTX 480 GPU flooplan,

295function, 294GPU hardware levels, 296GPU Memory performance, 332GPU programming, 289–290Grid mapping, 293mapping example, 293multithreaded SIMD Processor, 294NVIDIA GPU computational

structures, 291NVIDIA GPU Memory structures,

304PTX Instructions, 298

Thread Block Schedulerdefinition, 292, 309, 313–314Fermi GTX 480 GPU flooplan, 295function, 294, 311GPU, 296Grid mapping, 293multithreaded SIMD Processor, 294

Thread-level parallelism (TLP)advanced directory protocol case

study, 420–426Amdahl’s law and parallel

computers, 406–407centralized shared-memory

multiprocessorsbasic considerations, 351–352cache coherence, 352–353cache coherence enforcement,

354–355cache coherence example,

357–362cache coherence extensions,

362–363invalidate protocol

implementation, 356–357

SMP and snooping limitations, 363–364

snooping coherence implementation, 365–366

snooping coherence protocols, 355–356

definition, 9directory-based cache coherence

case study, 418–420protocol basics, 380–382protocol example, 382–386

DSM and directory-based coherence, 378–380

embedded systems, E-15IBM Power7, 215from ILP, 4–5inclusion, 397–398Intel Core i7 performance/energy

efficiency, 401–405memory consistency models

basic considerations, 392–393compiler optimization, 396programming viewpoint,

393–394relaxed consistency models,

394–395speculation to hide latency,

396–397MIMDs, 344–345multicore processor performance,

400–401multicore processors and SMT,

404–405multiprocessing/

multithreading-based performance, 398–400

multiprocessor architecture, 346–348

multiprocessor cost effectiveness, 407multiprocessor performance,

405–406multiprocessor software

development, 407–409vs. multithreading, 223–224multithreading history, L-34 to L-35parallel processing challenges,

349–351single-chip multicore processor

case study, 412–418Sun T1 multithreading, 226–229symmetric shared-memory

multiprocessor performance

commercial workload, 367–369commercial workload

measurement, 369–374

Page 75: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-77

multiprogramming and OS workload, 374–378

overview, 366–367synchronization

basic considerations, 386–387basic hardware primitives,

387–389locks via coherence, 389–391

Thread Processordefinition, 292, 314GPU, 315

Thread Processor Registers, definition, 292

Thread Scheduler in a Multithreaded CPU, definition, 292

Thread of SIMD Instructionscharacteristics, 295–296CUDA Thread, 303definition, 292, 313Grid mapping, 293lane recognition, 300scheduling example, 297terminology comparison, 314vector/GPU comparison, 308–309

Thread of Vector Instructions, definition, 292

Three-dimensional space, direct networks, F-38

Three-level cache hierarchycommercial workloads, 368ILP, 245Intel Core i7, 118, 118

Throttling, packets, F-10Throughput, see also Bandwidth

definition, C-3, F-13disk storage, D-4Google WSC, 470ILP, 245instruction fetch bandwidth, 202Intel Core i7, 236–237kernel characteristics, 327memory banks, 276multiple lanes, 271parallelism, 44performance considerations, 36performance trends, 18–19pipelining basics, C-10precise exceptions, C-60producer-server model, D-16vs. response time, D-17routing comparison, F-54server benchmarks, 40–41

servers, 7storage systems, D-16 to D-18uniprocessors, TLP

basic considerations, 223–226fine-grained multithreading on

Sun T1, 226–229superscalar SMT, 230–232

and virtual channels, F-93WSCs, 434

Tickscache coherence, 391processor performance equation,

48–49Tilera TILE-Gx processors, OCNs,

F-3Time-cost relationship, components,

27–28Time division multiple access

(TDMA), cell phones, E-25

Time of flightcommunication latency, I-3 to I-4interconnection networks, F-13

Timing independent, L-17 to L-18TI TMS320C6x DSP

architecture, E-9characteristics, E-8 to E-10instruction packet, E-10

TI TMS320C55 DSParchitecture, E-7characteristics, E-7 to E-8data operands, E-6

TLB, see Translation lookaside buffer (TLB)

TLP, see Task-level parallelism (TLP); Thread-level parallelism (TLP)

Tomasulo’s algorithmadvantages, 177–178dynamic scheduling, 170–176FP unit, 185loop-based example, 179, 181–183MIP FP unit, 173register renaming vs. ROB, 209step details, 178, 180

TOP500, L-58Top Of Stack (TOS) register, ISA

operands, A-4Topology

Bens networks, F-33centralized switched networks,

F-30 to F-34, F-31

definition, F-29direct networks, F-37distributed switched networks,

F-34 to F-40interconnection networks, F-21 to

F-22, F-44basic considerations, F-29 to

F-30fault tolerance, F-67

network performance and cost, F-40

network performance effects, F-40 to F-44

rings, F-36routing/arbitration/switching

impact, F-52system area network history, F-100

to F-101Torus networks

characteristics, F-36commercial interconnection

networks, F-63direct networks, F-37fault tolerance, F-67IBM Blue Gene/L, F-72 to F-74NEWS communication, F-43routing comparison, F-54system area network history, F-102

TOS, see Top Of Stack (TOS) registerTotal Cost of Ownership (TCO), WSC

case study, 476–479Total store ordering, relaxed

consistency models, 395Tournament predictors

early schemes, L-27 to L-28ILP for realizable processors, 216local/global predictor

combinations, 164–166Toy programs, performance

benchmarks, 37TP, see Transaction-processing (TP)TPC, see Transaction Processing

Council (TPC)Trace compaction, basic process, H-19Trace scheduling

basic approach, H-19 to H-21overview, H-20

Trace selection, definition, H-19Tradebeans benchmark, SMT on

superscalar processors, 230

Traffic intensity, queuing theory, D-25

Page 76: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-78 ■ Index

Trailermessages, F-6packet format, F-7

Transaction components, D-16, D-17, I-38 to I-39

Transaction-processing (TP)server benchmarks, 41storage system benchmarks, D-18

to D-19Transaction Processing Council (TPC)

benchmarks overview, D-18 to D-19, D-19

parallelism, 44performance results reporting, 41server benchmarks, 41TPC-B, shared-memory

workloads, 368TPC-C

file system benchmarking, D-20

IBM eServer p5 processor, 409multiprocessing/

multithreading-based performance, 398

multiprocessor cost effectiveness, 407

single vs. multiple thread executions, 228

Sun T1 multithreading unicore performance, 227–229, 229

WSC services, 441TPC-D, shared-memory

workloads, 368–369TPC-E, shared-memory

workloads, 368–369Transfers, see also Data transfers

as early control flow instruction definition, A-16

Transforms, DSP, E-5Transient failure, commercial

interconnection networks, F-66

Transient faults, storage systems, D-11Transistors

clock rate considerations, 244dependability, 33–36energy and power, 23–26ILP, 245performance scaling, 19–21processor comparisons, 324processor trends, 2

RISC instructions, A-3shrinking, 55static power, 26technology trends, 17–18

Translation buffer (TB)virtual memory block

identification, B-45virtual memory fast address

translation, B-46Translation lookaside buffer (TLB)

address translation, B-39AMD64 paged virtual memory,

B-56 to B-57ARM Cortex-A8, 114–115cache optimization, 80, B-37coining of term, L-9Intel Core i7, 118, 120–121interconnection network

protection, F-86memory hierarchy, B-48 to B-49memory hierarchy basics, 78MIPS64 instructions, K-27Opteron, B-47Opteron memory hierarchy, B-57RISC code size, A-23shared-memory workloads,

369–370speculation advantages/

disadvantages, 210–211strided access interactions,

323Virtual Machines, 110virtual memory block

identification, B-45virtual memory fast address

translation, B-46virtual memory page size selection,

B-47virtual memory protection,

106–107Transmission Control Protocol (TCP),

congestion management, F-65

Transmission Control Protocol/Internet Protocol (TCP/IP)

ATM, F-79headers, F-84internetworking, F-81, F-83 to

F-84, F-89reliance on, F-95WAN history, F-98

Transmission speed, interconnection network performance, F-13

Transmission timecommunication latency, I-3 to I-4time of flight, F-13 to F-14

Transport latencytime of flight, F-14topology, F-35 to F-36

Transport layer, definition, F-82Transputer, F-100Tree-based barrier, large-scale

multiprocessor synchronization, I-19

Tree height reduction, definition, H-11Trees, MINs with nonblocking, F-34Trellis codes, definition, E-7TRIPS Edge processor, F-63

characteristics, F-73Trojan horses

definition, B-51segmented virtual memory, B-53

True dependencefinding, H-7 to H-8loop-level parallelism calculations,

320vs. name dependence, 153

True sharing missescommercial workloads, 371, 373definition, 366–367multiprogramming workloads, 377

True speedup, multiprocessor performance, 406

TSMC, Stratton, F-3TSS operating system, L-9Turbo mode

hardware enhancements, 56microprocessors, 26

Turing, Alan, L-4, L-19Turn Model routing algorithm,

example calculations, F-47 to F-48

Two-level branch predictorsbranch costs, 163Intel Core i7, 166tournament predictors, 165

Two-level cache hierarchycache optimization, B-31ILP, 245

Two’s complement, J-7 to J-8Two-way conflict misses, definition,

B-23

Page 77: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-79

Two-way set associativityARM Cortex-A8, 233cache block placement, B-7, B-8cache miss rates, B-24cache miss rates vs. size, B-33cache optimization, B-38cache organization calculations,

B-19 to B-20commercial workload, 370–373,

371multiprogramming workload,

374–375nonblocking cache, 84Opteron data cache, B-13 to B-142:1 cache rule of thumb, B-29virtual to cache access scenario,

B-39TX-2, L-34, L-49“Typical” program, instruction set

considerations, A-43

UU, see Rack units (U)Ultrix, DECstation 5000 reboots, F-69UMA, see Uniform memory access

(UMA)Unbiased exponent, J-15Uncached state, directory-based cache

coherence protocol basics, 380, 384–386

Unconditional branchesbranch folding, 206branch-prediction schemes, C-25

to C-26VAX, K-71

Underflowfloating-point arithmetic, J-36 to

J-37, J-62gradual, J-15

Unicasting, shared-media networks, F-24

Unicode characterMIPS data types, A-34operand sizes/types, 12popularity, A-14

Unified cacheAMD Opteron example, B-15performance, B-16 to B-17

Uniform memory access (UMA)multicore single-chip

multiprocessor, 364SMP, 346–348

Uninterruptible instructionhardware primitives, 388synchronization, 386

Uninterruptible power supply (UPS)Google WSC, 467WSC calculations, 435WSC infrastructure, 447

Uniprocessorscache protocols, 359development views, 344linear speedups, 407memory hierarchy design, 73memory system coherency, 353,

358misses, 371, 373multiprogramming workload,

376–377multithreading

basic considerations, 223–226fine-grained on T1, 226–229simultaneous, on superscalars,

230–232parallel vs. sequential programs,

405–406processor performance trends, 3–4,

344SISD, 10software development, 407–408

Unit stride addressinggather-scatter, 280GPU vs. MIMD with Multimedia

SIMD, 327GPUs vs. vector architectures, 310multimedia instruction compiler

support, A-31NVIDIA GPU ISA, 300Roofline model, 287

UNIVAC I, L-5UNIX systems

architecture costs, 2block servers vs. filers, D-35cache optimization, B-38floating point remainder, J-32miss statistics, B-59multiprocessor software

development, 408multiprogramming workload, 374seek distance comparison, D-47vector processor history, G-26

Unpacked decimal, A-14, J-16Unshielded twisted pair (UTP), LAN

history, F-99

Up*/down* routingdefinition, F-48fault tolerance, F-67

UPS, see Uninterruptible power supply (UPS)

USB, Sony PlayStation 2 Emotion Engine case study, E-15

Use bitaddress translation, B-46segmented virtual memory, B-52virtual memory block replacement,

B-45User-level communication, definition,

F-8User maskable events, definition, C-45

to C-46User nonmaskable events, definition,

C-45User-requested events, exception

requirements, C-45Utility computing, 455–461, L-73 to

L-74Utilization

I/O system calculations, D-26queuing theory, D-25

UTP, see Unshielded twisted pair (UTP)

VValid bit

address translation, B-46block identification, B-7Opteron data cache, B-14paged virtual memory, B-56segmented virtual memory, B-52snooping, 357symmetric shared-memory

multiprocessors, 366Value prediction

definition, 202hardware-based speculation, 192ILP, 212–213, 220speculation, 208

VAPI, InfiniBand, F-77Variable length encoding

control flow instruction branches, A-18

instruction sets, A-22ISAs, 14

Variablesand compiler technology, A-27 to

A-29

Page 78: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-80 ■ Index

Variables (continued)CUDA, 289Fermi GPU, 306ISA, A-5, A-12locks via coherence, 389loop-level parallelism, 316memory consistency, 392NVIDIA GPU Memory, 304–305procedure invocation options,

A-19random, distribution, D-26 to D-34register allocation, A-26 to A-27in registers, A-5synchronization, 375TLP programmer’s viewpoint, 394

VCs, see Virtual channels (VCs)Vector architectures

computer development, L-44 to L-49definition, 9DLP

basic considerations, 264definition terms, 309gather/scatter operations,

279–280multidimensional arrays,

278–279multiple lanes, 271–273programming, 280–282vector execution time, 268–271vector-length registers,

274–275vector load/store unit

bandwidth, 276–277vector-mask registers, 275–276vector processor example,

267–268VMIPS, 264–267

GPU conditional branching, 303vs. GPUs, 308–312mapping examples, 293memory systems, G-9 to G-11multimedia instruction compiler

support, A-31vs. Multimedia SIMD Extensions,

282peak performance vs. start-up

overhead, 331power/DLP issues, 322vs. scalar performance, 331–332start-up latency and dead time, G-8strided access-TLB interactions,

323

vector-register characteristics, G-3Vector Functional Unit

vector add instruction, 272–273vector execution time, 269vector sequence chimes, 270VMIPS, 264

Vector Instructiondefinition, 292, 309DLP, 322Fermi GPU, 305gather-scatter, 280instruction-level parallelism, 150mask registers, 275–276Multimedia SIMD Extensions, 282multiple lanes, 271–273Thread of Vector Instructions, 292vector execution time, 269vector vs. GPU, 308, 311vector processor example, 268VMIPS, 265–267, 266

Vectorizable Loopcharacteristics, 268definition, 268, 292, 313Grid mapping, 293Livermore Fortran kernel

performance, 331mapping example, 293NVIDIA GPU computational

structures, 291Vectorized code

multimedia compiler support, A-31vector architecture programming,

280–282vector execution time, 271VMIPS, 268

Vectorized Loop, see also Body of Vectorized Loop

definition, 309GPU Memory structure, 304vs. Grid, 291, 308mask registers, 275NVIDIA GPU, 295vector vs. GPU, 308

Vectorizing compilerseffectiveness, G-14 to G-15FORTRAN test kernels, G-15sparse matrices, G-12 to G-13

Vector Lane Registers, definition, 292Vector Lanes

control processor, 311definition, 292, 309SIMD Processor, 296–297, 297

Vector-length register (VLR)basic operation, 274–275performance, G-5VMIPS, 267

Vector load/store unitmemory banks, 276–277VMIPS, 265

Vector loopsNVIDIA GPU, 294processor example, 267strip-mining, 303vector vs. GPU, 311vector-length registers, 274–275vector-mask registers, 275–276

Vector-mask control, characteristics, 275–276

Vector-mask registersbasic operation, 275–276Cray X1, G-21 to G-22VMIPS, 267

Vector Processorcaches, 305compiler vectorization, 281Cray X1

MSP modules, G-22overview, G-21 to G-23

Cray X1E, G-24definition, 292, 309DLP processors, 322DSP media extensions, E-10example, 267–268execution time, G-7functional units, 272gather-scatter, 280vs. GPUs, 276historical background, G-26loop-level parallelism, 150loop unrolling, 196measures, G-15 to G-16memory banks, 277and multiple lanes, 273, 310multiprocessor architecture, 346NVIDIA GPU computational

structures, 291overview, G-25 to G-26peak performance focus, 331performance, G-2 to G-7

start-up and multiple lanes, G-7 to G-9

performance comparison, 58performance enhancement

chaining, G-11 to G-12

Page 79: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-81

DAXPY on VMIPS, G-19 to G-21

sparse matrices, G-12 to G-14PTX, 301Roofline model, 286–287, 287vs. scalar processor, 311, 331, 333,

G-19vs. SIMD Processor, 294–296Sony PlayStation 2 Emotion

Engine, E-17 to E-18start-up overhead, G-4stride, 278strip mining, 275vector execution time, 269–271vector/GPU comparison, 308vector kernel implementation,

334–336VMIPS, 264–265VMIPS on DAXPY, G-17VMIPS on Linpack, G-17 to G-19

Vector Registersdefinition, 309execution time, 269, 271gather-scatter, 280multimedia compiler support, A-31Multimedia SIMD Extensions, 282multiple lanes, 271–273NVIDIA GPU, 297NVIDIA GPU ISA, 298performance/bandwidth trade-offs,

332processor example, 267strides, 278–279vector vs. GPU, 308, 311VMIPS, 264–267, 266

Very-large-scale integration (VLSI)early computer arithmetic, J-63interconnection network topology,

F-29RISC history, L-20Wallace tree, J-53

Very Long Instruction Word (VLIW)clock rates, 244compiler scheduling, L-31EPIC, L-32IA-64, H-33 to H-34ILP, 193–196loop-level parallelism, 315M32R, K-39 to K-40multiple-issue processors, 194,

L-28 to L-30multithreading history, L-34

sample code, 252TI 320C6x DSP, E-8 to E-10

VGA controller, L-51Video

Amazon Web Services, 460application trends, 4PMDs, 6WSCs, 8, 432, 437, 439

Video games, multimedia support, K-17

VI interface, L-73Virtual address

address translation, B-46AMD64 paged virtual memory, B-55AMD Opteron data cache, B-12 to

B-13ARM Cortex-A8, 115cache optimization, B-36 to B-39GPU conditional branching, 303Intel Core i7, 120mapping to physical, B-45memory hierarchy, B-39, B-48,

B-48 to B-49memory hierarchy basics, 77–78miss rate vs. cache size, B-37Opteron mapping, B-55Opteron memory management,

B-55 to B-56and page size, B-58page table-based mapping, B-45translation, B-36 to B-39virtual memory, B-42, B-49

Virtual address spaceexample, B-41main memory block, B-44

Virtual cachesdefinition, B-36 to B-37issues with, B-38

Virtual channels (VCs), F-47HOL blocking, F-59Intel SCCC, F-70routing comparison, F-54switching, F-51 to F-52switch microarchitecture

pipelining, F-61system area network history, F-101and throughput, F-93

Virtual cut-through switching, F-51Virtual functions, control flow

instructions, A-18Virtualizable architecture

Intel 80x86 issues, 128

system call performance, 141Virtual Machines support, 109VMM implementation, 128–129

Virtualizable GPUs, future technology, 333

Virtual machine monitor (VMM)characteristics, 108nonvirtualizable ISA, 126,

128–129requirements, 108–109Virtual Machines ISA support,

109–110Xen VM, 111

Virtual Machines (VMs)Amazon Web Services, 456–457cloud computing costs, 471early IBM work, L-10ISA support, 109–110protection, 107–108protection and ISA, 112server benchmarks, 40and virtual memory and I/O,

110–111WSCs, 436Xen VM, 111

Virtual memorybasic considerations, B-40 to B-44,

B-48 to B-49basic questions, B-44 to B-46block identification, B-44 to B-45block placement, B-44block replacement, B-45vs. caches, B-42 to B-43classes, B-43definition, B-3fast address translation, B-46Multimedia SIMD Extensions, 284multithreading, 224paged example, B-54 to B-57page size selection, B-46 to B-47parameter ranges, B-42Pentium vs. Opteron protection,

B-57protection, 105–107segmented example, B-51 to B-54strided access-TLB interactions,

323terminology, B-42Virtual Machines impact, 110–111writes, B-45 to B-46

Virtual methods, control flow instructions, A-18

Page 80: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-82 ■ Index

Virtual output queues (VOQs), switch microarchitecture, F-60

VLIW, see Very Long Instruction Word (VLIW)

VLR, see Vector-length register (VLR)

VLSI, see Very-large-scale integration (VLSI)

VMCS, see Virtual Machine Control State (VMCS)

VME rackexample, D-38Internet Archive Cluster, D-37

VMIPSbasic structure, 265DAXPY, G-18 to G-20DLP, 265–267double-precision FP operations,

266enhanced, DAXPY performance,

G-19 to G-21gather/scatter operations, 280ISA components, 264–265multidimensional arrays, 278–279Multimedia SIMD Extensions, 282multiple lanes, 271–272peak performance on DAXPY,

G-17performance, G-4performance on Linpack, G-17 to

G-19sparse matrices, G-13start-up penalties, G-5vector execution time, 269–270,

G-6 to G-7vector vs. GPU, 308vector-length registers, 274vector load/store unit bandwidth,

276vector performance measures,

G-16vector processor example,

267–268VLR, 274

VMM, see Virtual machine monitor (VMM)

VMs, see Virtual Machines (VMs)Voltage regulator controller (VRC),

Intel SCCC, F-70Voltage regulator modules (VRMs),

WSC server energy efficiency, 462

Volume-cost relationship, components, 27–28

Von Neumann, John, L-2 to L-6Von Neumann computer, L-3Voodoo2, L-51VOQs, see Virtual output queues

(VOQs)VRC, see Voltage regulator controller

(VRC)VRMs, see Voltage regulator modules

(VRMs)

WWafers

example, 31integrated circuit cost trends,

28–32Wafer yield

chip costs, 32definition, 30

Waiting line, definition, D-24Wait time, shared-media networks,

F-23Wallace tree

example, J-53, J-53historical background, J-63

Wall-clock timeexecution time, 36scientific applications on parallel

processors, I-33WANs, see Wide area networks

(WANs)WAR, see Write after read (WAR)Warehouse-scale computers (WSCs)

Amazon Web Services, 456–461basic concept, 432characteristics, 8cloud computing, 455–461cloud computing providers,

471–472cluster history, L-72 to L-73computer architecture

array switch, 443basic considerations, 441–442memory hierarchy, 443,

443–446, 444storage, 442–443

as computer class, 5computer cluster forerunners,

435–436cost-performance, 472–473costs, 452–455, 453–454

definition, 345and ECC memory, 473–474efficiency measurement, 450–452facility capital costs, 472Flash memory, 474–475Google

containers, 464–465cooling and power, 465–468monitoring and repairing,

469–470PUE, 468server, 467servers, 468–469

MapReduce, 437–438network as bottleneck, 461physical infrastructure and costs,

446–450power modes, 472programming models and

workloads, 436–441query response-time curve, 482relaxed consistency, 439resource allocation, 478–479server energy efficiency, 462–464vs. servers, 432–434SPECPower benchmarks, 463switch hierarchy, 441–442, 442TCO case study, 476–478

Warp, L-31definition, 292, 313terminology comparison, 314

Warp Schedulerdefinition, 292, 314Multithreaded SIMD Processor,

294Wavelength division multiplexing

(WDM), WAN history, F-98

WAW, see Write after write (WAW)Way prediction, cache optimization,

81–82Way selection, 82WB, see Write-back cycle (WB)WCET, see Worst-case execution time

(WCET)WDM, see Wavelength division

multiplexing (WDM)Weak ordering, relaxed consistency

models, 395Weak scaling, Amdahl’s law and

parallel computers, 406–407

Page 81: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

Index ■ I-83

Web index search, shared-memory workloads, 369

Web serversbenchmarking, D-20 to D-21dependability benchmarks, D-21ILP for realizable processors, 218performance benchmarks, 40WAN history, F-98

Weighted arithmetic mean time, D-27Weitek 3364

arithmetic functions, J-58 to J-61chip comparison, J-58chip layout, J-60

West-first routing, F-47 to F-48Wet-bulb temperature

Google WSC, 466WSC cooling systems, 449

Whirlwind project, L-4Wide area networks (WANs)

ATM, F-79characteristics, F-4cross-company interoperability, F-64effective bandwidth, F-18fault tolerance, F-68historical overview, F-97 to F-99InfiniBand, F-74interconnection network domain

relationship, F-4latency and effective bandwidth,

F-26 to F-28offload engines, F-8packet latency, F-13, F-14 to F-16routers/gateways, F-79switches, F-29switching, F-51time of flight, F-13topology, F-30

Wilkes, Maurice, L-3Winchester, L-78Window

latency, B-21processor performance

calculations, 218scoreboarding definition, C-78TCP/IP headers, F-84

Windowing, congestion management, F-65

Window sizeILP limitations, 221ILP for realizable processors,

216–217vs. parallelism, 217

Windows operating systems, see Microsoft Windows

Wireless networksbasic challenges, E-21and cell phones, E-21 to E-22

Wiresenergy and power, 23scaling, 19–21

Within instruction exceptionsdefinition, C-45instruction set complications, C-50stopping/restarting execution, C-46

Word count, definition, B-53Word displacement addressing, VAX,

K-67Word offset, MIPS, C-32Words

aligned/misaligned addresses, A-8AMD Opteron data cache, B-15DSP, E-6Intel 80x86, K-50memory address interpretation,

A-7 to A-8MIPS data transfers, A-34MIPS data types, A-34MIPS unaligned reads, K-26operand sizes/types, 12as operand type, A-13 to A-14VAX, K-70

Working set effect, definition, I-24Workloads

execution time, 37Google search, 439Java and PARSEC without SMT,

403–404RAID performance prediction,

D-57 to D-59symmetric shared-memory

multiprocessor performance, 367–374, I-21 to I-26

WSC goals/requirements, 433WSC resource allocation case

study, 478–479WSCs, 436–441

Wormhole switching, F-51, F-88performance issues, F-92 to F-93system area network history, F-101

Worst-case execution time (WCET), definition, E-4

Write after read (WAR)data hazards, 153–154, 169

dynamic scheduling with Tomasulo’s algorithm, 170–171

hazards and forwarding, C-55ILP limitation studies, 220MIPS scoreboarding, C-72, C-74

to C-75, C-79multiple-issue processors, L-28register renaming vs. ROB, 208ROB, 192TI TMS320C55 DSP, E-8Tomasulo’s advantages, 177–178Tomasulo’s algorithm, 182–183

Write after write (WAW)data hazards, 153, 169dynamic scheduling with

Tomasulo’s algorithm, 170–171

execution sequences, C-80hazards and forwarding, C-55 to

C-58ILP limitation studies, 220microarchitectural techniques case

study, 253MIPS FP pipeline performance,

C-60 to C-61MIPS scoreboarding, C-74, C-79multiple-issue processors, L-28register renaming vs. ROB, 208ROB, 192Tomasulo’s advantages, 177–178

Write allocateAMD Opteron data cache, B-12definition, B-11example calculation, B-12

Write-back cacheAMD Opteron example, B-12, B-14coherence maintenance, 381coherency, 359definition, B-11directory-based cache coherence,

383, 386Flash memory, 474FP register file, C-56invalidate protocols, 355–357, 360memory hierarchy basics, 75snooping coherence, 355,

356–357, 359Write-back cycle (WB)

basic MIPS pipeline, C-36data hazard stall minimization,

C-17

Page 82: bjpcjp.github.io · Index I-3 AMD Barcelona microprocessor, Google WSC server, 467 AMD Fusion, L-52 AMD K-5, L-30 AMD Opteron address translation, B-38 Amazon Web Services, 457 architecture,

I-84 ■ Index

Write-back cycle (continued )execution sequences, C-80hazards and forwarding, C-55 to

C-56MIPS exceptions, C-49MIPS pipeline, C-52MIPS pipeline control, C-39MIPS R4000, C-63, C-65MIPS scoreboarding, C-74pipeline branch issues, C-40RISC classic pipeline, C-7 to C-8,

C-10simple MIPS implementation,

C-33simple RISC implementation, C-6

Write broadcast protocol, definition, 356

Write bufferAMD Opteron data cache, B-14Intel Core i7, 118, 121invalidate protocol, 356memory consistency, 393memory hierarchy basics, 75miss penalty reduction, 87, B-32,

B-35 to B-36write merging example, 88write strategy, B-11

Write hitcache coherence, 358directory-based coherence, 424single-chip multicore

multiprocessor, 414snooping coherence, 359write process, B-11

Write invalidate protocoldirectory-based cache coherence

protocol example, 382–383

example, 359, 360implementation, 356–357snooping coherence, 355–356

Write mergingexample, 88miss penalty reduction, 87

Write missAMD Opteron data cache, B-12,

B-14cache coherence, 358, 359, 360, 361definition, 385directory-based cache coherence,

380–383, 385–386example calculation, B-12locks via coherence, 390memory hierarchy basics, 76–77memory stall clock cycles, B-4Opteron data cache, B-12, B-14snooping cache coherence, 365write process, B-11 to B-12write speed calculations, 393

Write result stagedata hazards, 154dynamic scheduling, 174–175hardware-based speculation, 192instruction steps, 175ROB instruction, 186scoreboarding, C-74 to C-75, C-78

to C-80status table examples, C-77Tomasulo’s algorithm, 178, 180,

190Write serialization

hardware primitives, 387multiprocessor cache coherency,

353snooping coherence, 356

Write stall, definition, B-11Write strategy

memory hierarchy considerations, B-6, B-10 to B-12

virtual memory, B-45 to B-46Write-through cache

average memory access time, B-16

coherency, 352invalidate protocol, 356memory hierarchy basics, 74–75miss penalties, B-32optimization, B-35snooping coherence, 359write process, B-11 to B-12

Write update protocol, definition, 356WSCs, see Warehouse-scale

computers (WSCs)

XXBox, L-51Xen Virtual Machine

Amazon Web Services, 456–457characteristics, 111

Xerox Palo Alto Research Center, LAN history, F-99

XIMD architecture, L-34Xon/Xoff, interconnection networks,

F-10, F-17

YYahoo!, WSCs, 465Yield

chip fabrication, 61–62cost trends, 27–32Fermi GTX 480, 324

ZZ-80 microcontroller, cell phones,

E-24Zero condition code, MIPS core, K-9

to K-16Zero-copy protocols

definition, F-8message copying issues, F-91

Zero-load latency, Intel SCCC, F-70

Zuse, Konrad, L-4 to L-5Zynga, FarmVille, 460