Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching
-
Upload
alexandra-york -
Category
Documents
-
view
32 -
download
0
description
Transcript of Instructor: S. Masoud Sadjadi cs.fiu/~sadjadi/Teaching
1
Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/
sadjadi At cs Dot fiu Dot edu
Concurrent ComputersConcurrent Computers
2
Acknowledgements The content of many of the slides in this lecture notes have
been adopted from the online resources prepared previously by the people listed below. Many thanks!
Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova [email protected]
Kai Wang Department of Computer Science University of South Dakota http://www.usd.edu/~Kai.Wang
Andrew Tanenbaum
3
Concurrency and Computers
We will see computer systems designed to allow concurrency (for performance benefits)
Concurrency occurs at many levels in computer systems Within a CPU
For example, On-Chip Parallelism Within a “Box”
For example, Coprocessor and Multiprocessor Across Boxes
For example, Multicomputers, Clusters, and Grids
4
Parallel Computer Architectures
(a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor.
(d) A multicomputer. (e) A grid.
5
Concurrency and Computers
We will see computer systems designed to allow concurrency (for performance benefits)
Concurrency occurs at many levels in computer systems Within a CPU Within a “Box” Across Boxes
6
Concurrency within a CPU
CPU
Caches
RAM
Controllers Controllers
I/O devicesDisplaysKeyboards
Networks
adapters
Busses
Registers+ ALUs+ Hardware to decodeinstructions and do all types of useful things
7
Concurrency within a CPU
Several techniques to allow concurrency within a single CPU Pipelining
RISC architectures Pipelined functional units
ILP Vector units On-Chip Multithreading
Let’s look at them briefly
8
Concurrency within a CPU
Several techniques to allow concurrency within a single CPU Pipelining
RISC architectures Pipelined functional units
ILP Vector units On-Chip Multithreading
Let’s look at them briefly
9
Pipelining If one has a sequence of tasks to do If each task consists of the same n steps or stages If different steps can be done simultaneously Then one can have a pipelined execution of the tasks
e.g., for assembly line Goal: higher throughput (i.e., number of tasks per time unit)
Time to do 1 task = 9Time to do 2 tasks = 13Time to do 3 tasks = 17Time to do 4 tasks = 21Time to do 10 tasks = 45Time to do 100 tasks = 409
Pays off if many tasks
10
Pipelining
Each step goes as fast as the slowest stage
Therefore, the asymptotic throughput (i.e., the throughput when the number of tasks tends to infinity) is equal to:
1 / (duration of the slowest stage)
Therefore, in an ideal pipeline, all stages would be identical (balanced pipeline)
Question: Can we make computer instructions all consist of the same number of stage, where all stages take the same number of clock cycles?
duration of the slowest stage
11
RISC Having all instructions doable in the same number of
stages of the same durations is the RISC idea Example:
MIPS architecture (See THE architecture book by Patterson and Hennessy)
5 stages Instruction Fetch (IF) Instruction Decode (ID) Instruction Execute (EX) Memory accesses (MEM) Register Write Back (WB)
Each stage takes one clock cycle
IF ID EX MEM WB
IF ID EX MEM WB
LD R2, 12(R3)
DADD R3, R5, R6
Concurrent executionof two instructions
12
Pipelined Functional Units Although the RISC idea is attractive, some operations are just too
expensive to be done in one clock cycle (during the EX stage) Common example: floating point operations Solution: implement them as a sequence of stages, so that they can be
pipelined
IF ID MEM WB
EXInteger unit
M1
FP/integer multiply
M2 M3 M4 M5 M6 M7
A1 A2 A3 A4
FP/integer add
13
Pipelining Today Pipelined functional units are common Fallacy: All computers today are RISC
RISC was of course one of the most fundamental “new” ideas in computer architectures
x86: Most commonly used Instruction Set Architecture today Kept around for backwards compatibility reasons, because it’s easy
to implement (not to program for) BUT: modern x86 processors decode instructions into “micro-ops”,
which are then executed in a RISC manner
Bottom line: pipelining is a pervasive (and conveniently hidden) form of concurrency in computers today
Take a computer architecture course to know all about it
14
Concurrency within a CPU
Several techniques to allow concurrency within a single CPU Pipelining ILP Vector units On-Chip Multithreading
15
Instruction Level Parallelism Instruction Level Parallelism is the set of techniques by
which performance of a pipelined processor can be pushed even further
ILP can be done by the hardware Dynamic instruction scheduling Dynamic branch predictions Multi-issue superscalar processors
ILP can be done by the compiler Static instruction scheduling Multi-issue VLIW (Very Long Instruction Word) processors
with multiple functional units Broad concept: More than one instruction is issued per clock
cycle e.g., 8-way multi-issue processor
16
Concurrency within a CPU
Several techniques to allow concurrency within a single CPU Pipelining ILP Vector units On-Chip Multithreading
17
Vector Units A functional unit that can do elt-wise operations
on entire vectors with a single instruction, called a vector instruction These are specified as operations on vector registers A “vector processor” comes with some number of
such registers MMX extension on x86 architectures
#elts adds in parallel+
. . . . . .
. . .
#elts#elts
#elts
18
Vector Units Typically, a vector register holds ~ 32-64 elements But the number of elements is always larger than the
amount of parallel hardware, called vector pipes or lanes, say 2-4
#elts#elts
#elts
#elts / #pipes adds in parallel
+ + +
19
MMX Extension Many techniques that are initially implemented in the “supercomputer”
market, find their way to the mainstream Vector units were pioneered in supercomputers
Supercomputers are mostly used for scientific computing Scientific computing uses tons of arrays (to represent mathematical vectors
and often does regular computation with these arrays Therefore, scientific code is easy to “vectorize”, i.e., to generate assembly
that uses the vector registers and the vector instructions Intel’s MMX or PowerPC’s AltiVec
MMX vector registers eight 8-bit elements four 16-bit elements two 32-bit elements
AltiVec: twice the lengths Used for “multi-media” applications
image processing rendering ...
20
Vectorization Example Conversion from RGB to YUV
Y = (9798*R + 19235*G + 3736*B) / 32768;U = (-4784*R - 9437*G + 4221*B) / 32768 + 128;V = (20218*R - 16941*G - 3277*B) / 32768 + 128;
This kind of code is perfectly parallel as all pixels can be computed independently
Can be done easily with MMX vector capabilities Load 8 R values into an MMX vector register Load 8 G values into an MMX vector register Load 8 B values into an MMX vector register Do the *, +, and / in parallel Repeat
21
Concurrency within a CPU
Several techniques to allow concurrency within a single CPU Pipelining ILP Vector units On-Chip Multithreading
22
Multi-threaded Architectures Computer architecture is a difficult field to
make innovations in Who’s going to spend money to manufacture
your new idea? Who’s going to be convinced that a new compiler
can/should be written Who’s going to be convinced of a new approach
to computing? One of the “cool” innovations in the last
decade has been the concept of a “Multi-threaded Architecture”
23
On-Chip Multithreading
Multithreading has been around for years, so what’s new about this?
Here we’re talking about Hardware Support for threads Simultaneous Multi Threading (SMT) SuperThreading HyperThreading
Let’s try to understand what all of these mean before looking at multi-threaded Supercomputers
24
Single-treaded Processor CPU
Front-end: fetching/decoding/reordering
Execution core: actual execution
Multiple programs in memory Only one executes at a time
4-issue CPU with bubbles 7-unit CPU with pipeline bubbles
Time-slicing via context switching
25
Single-threaded SMP?
Two threads execute at once, so threads spend less time waiting
The number of “bubbles” is also doubled Twice as much speed and twice as much waste
26
Super-threading Principle: the processor can execute more
than one thread at a time Also called time-slice multithreading The processor is then called a multithreaded
processor Requires more hardware cleverness
logic switches at each cycle Leads to less Waste
A thread can run during a cycle while another thread is waiting for the memory
Just a finer grain of interleaving But there is a restriction
Each stage of the front end or the execution core only runs instructions from ONE thread!
Does not help with poor instruction parallelism within one thread
Does not reduce bubbles within a row
27
Hyper-threading Principle: the processor can execute
more than one thread at a time, even within a single clock cycle!!
Requires even more hardware cleverness
logic switches within each cycle On the diagram: Only two threads
execute simultaneously. Intel’s hyper-threading only adds 5% to
the die area Some people argue that “two” is not
“hyper” Finest level of interleaving From the OS perspective, there are two
“logical” processors
28
Concurrency and Computers
We will see computer systems designed to allow concurrency (for performance benefits)
Concurrency occurs at many levels in computer systems Within a CPU Within a “Box” Across Boxes
29
Concurrency within a “Box”
Two main techniques SMP Multi-core
Let’s look at both of them
30
SMPs Symmetric Multi-Processors
often mislabeled as “Shared-Memory Processors”, which has now become tolerated
Processors are all connected to a single memory Symmetric: each memory cell is equally close to all
processors Many dual-proc and quad-proc systems
e.g., for serversP1
network/bus
$
memory
P2
$
Pn
$
31
Distributed caches The problem with distributed caches is that of
memory consistency Intuitive memory model
Reading an address should return the last value written to that address
Easy to do in uniprocessors although there may be some I/O issues
But difficult in multi-processor / multi-core Memory consistency: “A multiprocessor is sequentially
consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979]
32
Cache Coherency Memory consistency is jeopardized by having
multiple caches P1 and P2 both have a cached copy of a data
item P1 write to it, possibly write-through to memory At this point P2 owns a stale copy
When designing a multi-processor system, one must ensure that this cannot happen By defining protocols for cache coherence
33
Snoopy Cache-Coherence
Memory bus is a broadcast medium Caches contain information on which addresses they store Cache Controller “snoops” all transactions on the bus
A transaction is a relevant transaction if it involves a cache block currently contained in this cache
Take action to ensure coherence invalidate, update, or supply value
StateAddressData
P0
$ $
Pn
Mem Mem
memory busmemory op from Pn
bus snoop
34
Limits of Snoopy Coherence
MEM MEM° ° °
PROC
cache
PROC
cache
° ° °
Assume:
4 GHz processor
=> 16 GB/s inst BW per processor (32-bit)
=> 9.6 GB/s data BW at 30% load-store of 8-byte elements
Suppose 98% inst hit rate and 90% data hit rate
=> 320 MB/s inst BW per processor
=> 960 MB/s data BW per processor
=> 1.28 GB/s combined BW
Assuming 10 GB/s bus bandwidth
8 processors will saturate the bus
25.6 GB/s
1.28 GB/s
35
Sample Machines
Intel Pentium Pro Quad Coherent 4 processors
Sun Enterprise server Coherent Up to 16 processor and/or
memory-I/O cards
P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)
CPU
Bus interface
MIU
P-Promodule
P-Promodule
P-Promodule256-KB
L2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PC
I bus
PC
I busPCI
I/Ocards
Gigaplane bus (256 data, 41 addr ess, 83 MHz)
SB
US
SB
US
SB
US
2 F
iber
Cha
nnel
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$
P
$2
$
Mem ctrl
Bus interface/switch
I/O cards
40
Concurrency within a “Box”
Two main techniques SMP Multi-core
41
Moore’s Law
Moore's Law describes an important trend in the history of computer hardware The number of transistors that can be
inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years.
The observation was first made by Intel co-founder Gordon E. Moore in a 1965 paper.
The trend has continued for more than half a century and is not expected to stop for another decade at least.
42
Moore’s Law! Many people interpret
Moore’s law as “computer gets twice as fast every 18 months”
which is not technically true it’s all about microprocessor
density But this is no longer true We should have 10GHz
processors right now And we don’t!
43
No more Moore? We are used to getting faster CPUs all the time We are used for them to keep up with more demanding
software Known as “Andy giveth, and Bill taketh away”
Andy Grove Bill Gates
It’s a nice way to force people to buy computers often But basically, our computers get better, do more
things, and it just happens automatically Some people call this the “performance free lunch” Conventional wisdom: “Not to worry, tomorrow’s
processors will have even more throughput, and anyway today’s applications are increasingly throttled by factors other than CPU throughput and memory speed (e.g., they’re often I/O-bound, network-bound, database-bound).”
44
Commodity improvements There are three main ways in which commodity
processors keep improving: Higher clock rate More aggressive instruction reordering and concurrent
units Bigger/faster caches
All applications can easily benefit from these improvements at the cost of perhaps a recompilation
Unfortunately, the first two are hitting their limit Higher clock rate lead to high heat, power consumption No more instruction reordering without compromising
correctness
45
Is Moore’s laws not true? Ironically, Moore’s law is still true
The density indeed still doubles But its wrong interpretation is not
Clock rates do not doubled any more But we can’t let this happen: computers have to get more
powerful Therefore, the industry has thought of new ways to improve
them: multi-core Multiple CPUs on a single chip
Multi-core adds another level of concurrency But unlike, say multiple functional units, hard to compile for them
Therefore, applications must be rewritten to benefit from the (nowadays expected) performance increase
“Concurrency is the next major revolution in how we write software” (Dr. Dobb’s Journal, 30(3), March 2005)
46
Multi-core processors In addition to putting concurrency in the public’s eye, multi-
core architectures will have deep impact Languages will be forced to deal well with concurrency
New language designs? New language extensions? New compilers?
Efficiency and Performance optimization will become more important: write code that is fast on one core with limited clock rate
The CPU may very well become a bottleneck (again) for single-core programs
Other factors will improve, but not the clock rate Prediction: many companies will be hiring people to (re)write
concurrent applications
47
Multi-Core
Quote from PC World Magazine Summer 2005:
“Don't expect dual-core to be the top performer today for games and other demanding single-threaded applications. But that will change as applications are rewritten. For example, by year's end, Unreal Tournament should have released a new game engine that takes advantage of dual-core processing.“
48
Concurrency and Computers
We will see computer systems designed to allow concurrency (for performance benefits)
Concurrency occurs at many levels in computer systems Within a CPU Within a “Box” Across Boxes
49
Multiple boxes together Example
Take four “boxes” e.g., four Intel Itaniums bought at Dell
Hook them up to a network e.g., a switch bought at CISCO, Myricom, etc.
Install software that allows you to write/run applications that can utilize these four boxes concurrently
This is a simple way to achieve concurrency across computer systems
Everybody has heard of “clusters” by now They are basically like the above example and can be purchased
already built from vendors We will talk about this kind of concurrent platform at length
during this class
50
Multiple Boxes Together Why do we use multiple boxes?
Every programmer would rather have an SMP/multi-core architecture that provides all the power/memory she/he needs
The problem is that single boxes do not scale to meet the needs of many scientific applications Can’t have enough processors or powerful enough cores Can’t have enough memory
But if you can live with a single box, do it! We will see that single-box programming is much easier
than multi-box programming
51
Where does this leave us? So far we have seen many ways in which
concurrency can be achieved/implemented in computer systems Within a box Across boxes
So we could look at a system and just list all the ways in which it does concurrency
It would be nice to have a great taxonomy of parallel platforms in which we can pigeon-hole all (past and present) systems Provides simple names that everybody can use and
understand quickly
52
Taxonomy of parallel machines?
It’s not going to happen Up until last year Gordon Bell and Jim Gray published an article in
Comm. of the ACM, discussing what the taxonomy should be Dongarra, Sterling, etc. answered telling them they were wrong and
saying what the taxonomy should be, and proposing a new multi-dimensional scheme!
Both papers agree that most terms are conflated, misused, etc. (MPP)
Complicated by the fact that concurrency appears at so many levels
Example: A 16-node cluster, where each node is a 4-way multi-processor, where each processor is hyperthreaded, has vector units, and is fully pipelined with multiple, pipelined functional units
53
Taxonomy of platforms? We’ll look at one traditional taxonomy We’ll look at current categorizations from Top500 We’ll look at examples of platforms We’ll look at interesting/noteworthy architectural
features that one should know as part of one’s parallel computing culture
54
The Flynn taxonomy Proposed in 1966!!! Functional taxonomy based on the notion of
streams of information: data and instructions Platforms are classified according to whether they
have a single (S) or multiple (M) stream of each of the above
Four possibilities SISD (sequential machine) SIMD MIMD MISD (rare, no commercial system... systolic arrays)
55
Taxonomy of Parallel Computers
Flynn’s taxonomy of parallel computers.
56
SIMD
PEs can be deactivated and activated on-the-fly Vector processing (e.g., vector add) is easy to
implement on SIMD Debate: is a vector processor an SIMD machine?
often confused strictly not true according to the taxonomy (it’s really SISD with
pipelined operations) but it’s convenient to think of the two as equivalent
ProcessingElement
ControlUnit
ProcessingElement
ProcessingElement
ProcessingElement
ProcessingElement
single streamof instructions
fetchdecodebroadcast
57
MIMD Most general category Pretty much every supercomputer in existence today is a
MIMD machine at some level This limits the usefulness of the taxonomy But you had to have heard of it at least once because
people keep referring to it, somehow... Other taxonomies have been proposed, none very satisfying
Shared- vs. Distributed- memory is a common distinction among machines, but these days many are hybrid anyway
58
Taxonomy of Parallel Computers
A taxonomy of parallel computers.
59
A host of parallel machines
There are (have been) many kinds of parallel machines
For the last 12 years their performance has been measured and recorded with the LINPACK benchmark, as part of Top500
It is a good source of information about what machines are (were) and how they have evolved
Note that it’s really about “supercomputers”
http://www.top500.org
61
What can we find on the Top500?
62
Pies
63
Top Ten Computers (http://www.top500.org)
64
Top 500 Computers--Countries (http://www.top500.org)
65
Top 500 Computers--Manufacturers (http://www.top500.org)
66
Top 500 Computers—Manufacturers Trend (http://www.top500.org)
67
Top 500 Computers--Operating Systems (http://www.top500.org)
68
Top 500 Computers—Operating Systems Trend (http://www.top500.org)
69
Top 500 Computers--Processors (http://www.top500.org)
70
Top 500 Computers—Processors Trend (http://www.top500.org)
71
Top 500 Computers--Customers (http://www.top500.org)
72
Top 500 Computers—Customers Trend (http://www.top500.org)
73
Top 500 Computers--Applications (http://www.top500.org)
74
Top 500 Computers--Applications Trend (http://www.top500.org)
75
Top 500 Computers—Architecture (http://www.top500.org)
76
Top 500 Computers—Architecture Trend (http://www.top500.org)
78
SMPs “Symmetric MultiProcessors” (often mislabeled as “Shared-Memory
Processors”, which has now become tolerated) Processors all connected to a (large) memory UMA: Uniform Memory Access, makes is easy to program Symmetric: all memory is equally close to all processors Difficult to scale to many processors (<32 typically) Cache Coherence via “snoopy caches” or “directories”
P1
network/bus
$
memory
P2
$
Pn
$
79
Distributed Shared Memory Memory is logically shared, but physically distributed in banks
Any processor can access any address in memory Cache lines (or pages) are passed around the machine Cache coherence: Distributed Directories NUMA: Non-Uniform Memory Access (some processors may be closer to some banks)
SGI Origin2000 is a canonical example Scales to 100s of processors Hypercube topology for the memory (later)
P1
$
memory
P2
$
Pn
$
memory memory
memorynetwork
80
Clusters, Constellations, MPPs These are the only 3 categories today in the Top500 They all belong to the Distributed Memory model (MIMD) (with many twists) Each processor/node has its own memory and cache but cannot directly access another processor’s memory.
nodes may be SMPs Each “node” has a network interface (NI) for all communication and synchronization.
So what are these 3 categories?
interconnect
P0
memory
NI
. . .
P1
memory
NI Pn
memory
NI
81
Clusters 58.2% of the Top500 machines are labeled as “clusters” Definition: Parallel computer system comprising an
integrated collection of independent “nodes”, each of which is a system in its own right capable on independent operation and derived from products developed and marketed for other standalone purposes
A commodity cluster is one in which both the network and the compute nodes are available in the market
In the Top500, “cluster” means “commodity cluster” A well-known type of commodity clusters are “Beowulf-class
PC clusters”, or “Beowulfs”
82
What is Beowulf? An experiment in parallel computing systems Established vision of low cost, high end computing,
with public domain software (and led to software development)
Tutorials and book for best practice on how to build such platforms
Today by Beowulf cluster one means a commodity cluster that runs Linux and GNU-type software
Project initiated by T. Sterling and D. Becker at NASA in 1994
83
Constellations??? Commodity clusters that differ from the previous
ones by the dominant level of parallelism Clusters consist of nodes, and nodes are typically
SMPs If there are more procs in a node than nodes in the
cluster, then we have a constellation Typically, constellations are space-shared among
users, with each user running openMP on a node, although an app could run on the whole machine using MPI/openMP
To be honest, this term is not very useful and not very used.
84
MPP???????? Probably the most imprecise term for describing a
machine (isn’t a 256-node cluster of 4-way SMPs massively parallel?)
May use proprietary networks, vector processors, as opposed to commodity component
Cray T3E, Cray X1, and Earth Simulator are distributed memory machines, but the nodes are SMPs.
Basicallly, everything that’s fast and not commodity is an MPP, in terms of today’s Top500.
Let’s look at these “non-commodity” things People’s definition of “commodity” varies
85
Vector Processors
Vector architectures were based on a single processor Multiple functional units All performing the same operation Instructions may specify large amounts of parallelism (e.g., 64-
way) but hardware executes only a subset in parallel Historically important
Overtaken by MPPs in the 90s as seen in Top500 Re-emerging in recent years
At a large scale in the Earth Simulator (NEC SX6) and Cray X1 At a small scale in SIMD media extensions to microprocessors
SSE, SSE2 (Intel: Pentium/IA64) Altivec (IBM/Motorola/Apple: PowerPC) VIS (Sun: Sparc)
Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to
86
Vector Processors Advantages
quick fetch and decode of a single instruction for multiple operations
the instruction provides the processor with a regular source of data, which can arrive at each cycle, and processed in a pipelined fashion
The compiler does the work for you of course Memory-to-memory
no registers can process very long vectors, but startup time is large appeared in the 70s and died in the 80s
Cray, Fujitsu, Hitachi, NEC
87
Global Address Space
Cray T3D, T3E, X1, and HP Alphaserver cluster Network interface supports “Remote Direct Memory Access”
NI can directly access memory without interrupting the CPU One processor can read/write memory with one-sided operations
(put/get) Not just a load/store as on a shared memory machine Remote data is typically not cached locally
interconnect
P0
memory
NI
. . .
P1
memory
NI Pn
memory
NI
95
Blue Gene/L 65,536 processors Relatively modest clock rates, so that power consumption is
low, cooling is easy, and space is small (1024 nodes in the same rack)
Besides, processor speed is on par with the memory speed so faster clock rate does not help
2-way SMP nodes (really different from the X1) several networks
64x32x32 3-D torus for point-to-point tree for collective operations and for I/O plus other Ethernet, etc.
96
BlueGene
The BlueGene/L custom processor chip.
97
BlueGeneThe BlueGene/L. (a) Chip. (b) Card. (c) Board.
(d) Cabinet. (e) System.
100
If you like dead Supercomputers Lots of old supercomputers w/ pictures
http://www.geocities.com/Athens/6270/superp.html Dead Supercomputers
http://www.paralogos.com/DeadSuper/Projects.html e-Bay
Cray Y-MP/C90, 1993 $45,100.70 From the Pittsburgh Supercomputer Center who wanted to get rid of
it to make space in their machine room Original cost: $35,000,000 Weight: 30 tons Cost $400,000 to make it work at the buyer’s ranch in Northern
California
101
Network Topologies People have experimented with different topologies for
distributed memory machines, or to arrange memory banks in NUMA shared-memory machines
Examples include: Ring: KSR (1991) 2-D grid: Intel Paragon (1992) Torus Hypercube: nCube, Intel iPSC/860, used in the SGI Origin
2000 for memory Fat-tree: IBM Colony and Federation Interconnects (SP-x) Arrangement of switches
pioneered with “Butterfly networks” like in the BBN TC2000 in the early 1990
200 MHz processors in a multi-stage network of switches Virtually Shared Distributed memory (NUMA) I actually worked with that one!
102
Hypercube
Defined by its dimension, d
4D
3D
2D
1D
103
Hypercube
Properties Has 2d nodes The number of hops between two nodes is at most d
The diameter of the network grows logarithmically with the number of nodes, which was the key for interest in hypercubes
But each node needs d neighbors, which is a problem Routing and Addressing
0110 0111 1110 1111
10111010
1101
1001
1100
1000
0101
00010000
0100
00110010
d-bit address routing from xxxx to yyyy:
just keep going to a neighbor that has a smaller Hamming distance
reminiscent of some p2p things
TONS of Hypercube research (even today!!)
104
Conclusion Concurrency appears at all levels Both in “commodity systems” and in “supercomputers”
The distinction is rather annoying When needing performance, one has to exploit concurrency
to the best of its capabilities e.g., as a developer of a geophysics application to run on a 10,000
heavy-iron supercomputers at the SANDIA national lab e.g., as a game developer on a 8-way multi-core hyper-threaded
desktop system sold by Dell In this course we’ll gain hands-on understanding of how to
write concurrent/parallel software Using GCB and MIND clusters Using the LA Grid and Open Sciences Grid